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Abstract 

The restless bandit problem is one of the most well-studied generalizations of the celebrated stochas- 
tic multi-armed bandit problem in decision theory. In its ultimate generality, the restless bandit problem 
is known to be PSPACE-Hard to approximate to any non-trivial factor, and little progress has been made 
on this problem despite its significance in modeling activity allocation under uncertainty. 

We consider a special case that we call FEEDBACK MAB, where the reward obtained by playing 
each of n independent arms varies according to an underlying on/off Markov process whose exact state 
is only revealed when the arm is played. The goal is to design a policy for playing the arms in order 
to maximize the infinite horizon time average expected reward. This problem is also an instance of a 
Partially Observable Markov Decision Process (POMDP), and is widely studied in wireless scheduling 
and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, the FEEDBACK MAB 
problem does not admit to greedy index-based optimal policies. The state of the system at any time step 
encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this 
aspect complicates the design and analysis of simple algorithms. 

We develop a novel and fairly general duality-based algorithmic technique that yields a surprisingly 
simple and intuitive 2 + e-approximate greedy policy to this problem. We then define a general sub-class 
of restless bandit problems that we term MONOTONE bandits, for which our policy is a 2-approximation. 
Our technique is robust enough to handle generalizations of these problems to incorporate various side- 
constraints such as blocking plays and switching costs. This technique is also of independent interest for 
other restless bandit problems, and we provide an example in non-preemptive machine replenishment. 
We finally show that our policies are closely related to the Whittle index that is widely used for its 
simplicity and efficiency of computation. In fact, not only is our policy just as efficient to compute as 
the Whittle index, but in addition, it provides surprisingly strong constant factor guarantees even in cases 
where the Whittle index is provably polynomially worse. 

By presenting the first (and efficient) 0(1) approximations for non-trivial instances of restless bandits 
as well as of POMDPs, our work initiates the study of approximation algorithms in both these contexts. 
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1 Introduction 



The celebrated multi-armed bandit problem (MAB) models the central trade-off in decision theory between 
exploration and exploitation, or in other words between learning about the state of a system and utilizing the 
system. In this problem, there are n competing options, referred to as "arms," yielding unknown rewards 
{r,;}. Playing an arm yields a reward drawn from an underlying distribution, and the information from the 
reward observed partially resolves its distribution. The goal is to sequentially play the arms in order to 
maximize reward obtained over some time horizon. 

Typically, the multi-armed bandit problem is studied under one of two assumptions: 

1 . The underlying reward distribution for each arm is fixed but unknown, and a prior of this distribution 
is specified as input (stochastic multi-armed bandits ElIIOlllllllSl); or 

2. The underlying rewards can vary with time in an adversarial fashion, and the comparison is against 
an optimal strategy that always plays one arm, albeit with the benefit of hindsight {adversarial multi- 
armed bandits l.7l [T5l[T7l ). 

Relaxing both the assumptions simultaneously leads to the notorious restless bandit problem in decision 
theory, which in its ultimate generality, is PSPACE hard to even approximate BTI . In the last two decades, 
in spite of the growth of approximation algorithms and the numerous applications of restless bandits ||3] [191 
|20l|2Tl|37l|40l[T3l|49l|52l, the approximability of these have remained unexplored. In this paper, we provide 
a general algorithmic technique that yields the first 0(1) approximations to a large class of these problems 
that are commonly studied in practice. 

An important subclass of restless bandit problems are situations where the system is agnostic of the ex- 
ploration - or the exploration gives us information about the state of the system but does not interfere with 
the evolution of the system. One such problem is the FEEDBACK MAB which models opportunistic multi- 
channel access at a wireless node [251 EOl : The bandit corresponds to a wireless node with access to 
multiple noisy channels (arms). The state of the arm is the state (good/bad) of the channel, which varies 
according to a bursty 2-state Markov process. Playing the arm corresponds to transmitting on the channel, 
yielding reward if the transmission is successful (good channel state), and at the same time revealing to the 
transmitter the current state of the channel. This corresponds to the Gilbert-Elliot model [33] of channel 
evolution. The goal is to find a transmission policy of choosing one channel to transmit on every time step, 
that maximizes the long-term transmission rate. FEEDBACK MAB also models Unmanned Aerial Vehicle 
(UAV) routing \^\: the arms are locations of possibly interesting events, and whether a location is inter- 
esting or uninteresting follows a 2-state Markov processes. Visiting a location by the UAV con^esponds to 
playing the arm, and yields reward if an interesting event is detected. The goal is to find a routing policy 
that maximizes the long-term average reward from interesting events. 

This problem is also a special case of Partially Observable Markov Decision Processes or POMDPs ||3T1 
ISlll5 |. The state of each arm evolves according to a Markov chain whose state is only observed when the 
arm is played. The player's partial information, encapsulated by the last observed state and the number of 
steps since last playing, yields a belief on the current state. (This belief is simply a probability distribution 
for the arm being good or bad.) The player uses this partial information in making the decision about which 
arm to play next, which in turn affects the information at future times. While such POMDPs are widely 
used in control theory, they aie in general notoriously intractable [10. 311. In this paper we provide the first 
0(1) approximation for the FEEDBACK MAB and a number of its important extensions. This represents the 
first approximation guarantee for a POMDP, and the first guarantee for a MAB problem with time-varying 
rewards that compares to an optimal solution allowed to switch arms at will. 

Before we present the problem statements formally, we survey literature on the stochastic multi-armed 
bandit problem. (We discuss adversarial MAB after we present our model and results.) 
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1.1 Background: Stochastic MAB and Restless Bandits 

The stochastic MAB was first formulated by Arrow et al ||2| and Robbins \A2 1 ■ It resides under a Bayesian (or 
decision theoretic) setting: we successively choose between several options given some prior information 
(specified by distributions), and our beliefs are updated via Bayes' rule conditioned on the results of our 
choices (observed rewards). 

More formally, we are given a "bandit" with n independent arms. Each arm i can be in one of several 
states belonging to the set Si. At any time step, the player can play one arm. If arm i in state k & Si 
is played, it transitions in a Markovian fashion to state j G Si w.p. q\-, and yields reward r[. > 0. The 
states of arms that are not played stay the same. The initial state models the prior knowledge about the 
arm. The states in general capture the posterior conditioned on the observations from sequential plays. The 
problems is, given the initial states of the arms, find a policy for playing the arms in order to maximize one of 
the following infinite horizon quantities: Yl't^o ^t/?* (discounted reward), or limt_>oo \ Yl't^o (average 
reward), where Rt is the expected reward of the policy at time step t and /3 G (0, 1) is a discount factor. A 
policy is a (possibly implicit) specification of fixing up front which arm (or distribution over arms) to play 
for every possible joint state of the arms. 

It is well-known that Bellman's equations flO'l yield the optimal policy by dynamic programming. The 
main issue in the stochastic setting is in efficiently computing and succinctly specifying the optimal policy: 
The input to an algorithm specifies the rewards and transition probabilities for each arm, and thus has size 
linear in n, but the state space is exponential in n. We seek polynomial-time algorithms (in terms of the 
input size) that compute (near-) optimal policies with poly-size specifications. Moreover, we require the 
policies to be executable each step in poly-time. 

Note that since a policy is a fixed (possibly randomized) mapping from the exponential size joint state 
space to a set of actions, ensuring poly-time computation and execution often requires simplifying the de- 
scription of the optimal policy using the problem structure. The stochastic MAB problem is the most well- 
known decision problem for which such a structure is known: The optimal policy is a greedy policy termed 
the GiTTlNS index policy |[T8ll46l[T0l . In general, an index policy specifies a single number called "index" 
for each state k & Si for each arm i, and at every time step, plays the arm whose current state has the highest 
index. Index policies are desirable since they can be compactly represented, so they are the heuristic method 
of choice for several MDP problems. In addition, index policies are also optimal for several generalizations 
of the stochastic MAB, such as arm-acquiring bandits [51] and branching bandits ll50l . In fact, a general 
characterization of problems for which index policies are optimal is now known |fT2|. 

Restless Bandits. In the stochastic MAB problem, the underlying reward distributions for each arm are 
fixed but unknown. However, if the rewards can vary with time, the problem stops admitting optimal index 
policies or efficient solutions. The problem now needs to be modeled as a restless bandit problem, first 
proposed by Whittle 1521 . The problem statement of the restless bandits is similar to stochastic MAB, 
except that when arm i in state k ^ Si is not played, it's state evolves to j G Si with probability q^j. 
Therefore, the state of each arm varies according to an active transition matrix q when the arm is played, 
and according to a passive transition matrix q if the arm is not played. Unlike the stochastic MAB problem, 
which is interesting only in the discounted reward setting the restless bandit problem is interesting even 
in the infinite horizon average reward setting - this is the setting in which the problem has been typically 
studied, and so we limit ourselves to this setting in this paper. It is relatively straightforward to show that 
no index policy can be optimal for these problems; in fact, Papadimitriou and Tsitsiklis [41 ] show that for n 
arms, even when all q and q values are either or 1 (deterministic transitions), computing the optimal policy 
is a PSPACE-hard problem. Their proof in fact shows that deciding if the optimal reward is non-zero is also 

' Playing the arm with the highest long-term average reward exclusively is the trivial optimal policy for stochastic MAB in the 
infinite-horizon average reward setting. 
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PSPACE-hard, hence ruling out any approximation algorithm as well. 

On the positive side, Whittle [52 1 presents a poly-size LP relaxation of the problem. In this relaxation, 
the constraint that exactly one arm is played per time step is replaced by the constraint that one arm on 
average is played per time step. In the LP, this is the only constraint connecting the arms. (Such decision 
problems have been termed weakly coupled systems 11291 m.) Based on the Lagrangean of this relaxation. 
Whittle [52] defines a heuristic index that generalizes the Gittins index. This is termed the Whittle Index (see 
Section [3]l. Though this index is widely used in practice and has excellent empirical performance [j3l [T9ll20l 
[21] [37, 40, 49], the known theoretical guarantees ( ||49l[T9l ) are very weak. In summary, despite being very 
well-motivated and extensively studied, there are almost no positive results on approximation guarantees for 
the restless bandit problems. 

1.2 Results and Roadmap 

We provide the first approximation algorithm for both a restless bandit problem and a partially observable 
Markov decision problem by providing a 2 + e-approximate index policy for the FEEDBACK MAB problem 
which belongs to both classes. We show several other results; however, before presenting the specifics, we 
place our contribution in the context of existing techniques in control theory. 

Technical Contributions. Our algorithmic technique for this problem (Section |2]l involves solving (in 
polynomial time) the Lagrangean of Whittle's LP relaxation for a suitable (and subtle) "balanced" choice of 
the Lagrange multiplier, converting this into a feasible index policy, and using an amortized accounting of 
the reward for the analysis. We show that this technique is closely related to the Whittle index Il52ll40l[30l . 
and in fact, provide the first approximation analysis of (a subtle variant of) the Whittle index which is widely 
used in control theory literature in the context of FEEDBACK MAB problems (Section |3]). We believe that 
analyzing the performance guarantees of the numerous indices used in the literature will increase and our 
analysis will provide an useful template. 

However, the key difference between Whittle's index and our index policy is the following: The former 
chooses one Lagrange multiplier (or index) per state of each arm, with the policy playing the arm with the 
largest index. This has the advantage of separate efficient computations for different arms; and in addition, 
such a policy (the Gittins index policy |[T8l ) is known to be optimal for the stochastic MAB. However, it is 
well-known [|4l|8][l4l that this intuition about playing the arm with the largest index being optimal becomes 
increasingly invalid when complicated side-constraints such as time-varying rewards (FEEDBACK MAB), 
blocking plays, and switching costs are introduced. In fact, we show a concrete problem in Section [8] where 
the Whittle index has a r2(n) performance gap. 

In contrast to the Whittle index, our technique chooses a single global Lagrange multiplier via a careful 
accounting of the reward, and develops a feasible policy from it. Unlike the Whittle index, this technique 
is sufficiently robust to encompass a large number of often-used variants of FEEDBACK MAB problems: 
Plays with varying duration (Section [5]), switching costs (Section [6]), and observation costs (Section [7]l. In 
fact, we identify a general MONOTONE condition in restless bandit problems under which our technique 
applies (Section [4]). Furthermore, our technique provides 0(1) approximations to other classic restless 
bandit problems even when Whittle's index is polynomially sub-optimal: We show an example in the non- 
preemptive machine replenishment problem (Section[8]). Finally, since our technique is based on solving the 
Lagrangeai|^(just like the Whittle index), the computation time is comparable to that for such indices. 

In summary, our technique succeeds in finding the first provably approximate policies for widely-studied 
control problems, without sacrificing efficiency in the process. We believe that the generality of this tech- 

^This aspect is explicit in Sections|2]and[3] However, in Sections |4]|8] we have presented our algorithm in terms of first solving 
a linear program. However, it is easy to see that this is equivalent to solving the Lagrangean, and hence to the computation required 
for Whittle's index. The details are quite standard and can be reconstructed from those in Sections[2|and[3| 
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nique will be useful for exploring other useful variations of these problems as well as providing an alternate 
algorithm for practitioners. 

Specific Results. In terms of specific results, the paper is organized as follows: 

• We begin by presenting a 2 + e-approximation for FEEDBACK bandits in Section |2] We also provide 
a e/(e — 1) integrality gap instance showing that out analysis is nearly tight. 

• In Section |3] we show that our analysis technique can be used to prove that a thresholded variant of 
the Whittle index is a 2 approximation. We also show instances where the reward of any index policy 
is at least 1 + ^7(1) factor from the reward of the optimal policy. Therefore although the Whittle index 
is not optimal, our result sheds light on its observed superior performance in this specific context. 

• In Section |4] we generalize the result in Section [2] to define a general sub-class of restless bandit 
problems based on a critical set of properties: Separability and monotonicity. For this subclass, 
termed MONOTONE bandits (which generalizes FEEDBACK MAB), we provide a 2 approximation 
by generalizing the technique in Section [2] Our technique now introduces a balance constraint in the 
dual of the natural LP relaxation, and constructs the index policy from the optimal dual solution. We 
further show that in the absence of monotonicity or separability, the problem is either NP-Hard to 
approximate, or has unbounded integrality gap respectively. 

• In Section |5] we extend FEEDBACK MAB (as well as MONOTONE bandits) to consider multiple si- 
multaneous blocking plays of varying durations. 

• In Section|6]we extend FEEDBACK MAB (and MONOTONE bandits) to consider switching costs. 

• In Section [7] we extend FEEDBACK MAB to a variant where the information acquisition is varied, 
namely, an arm has to be explicitly probed at some cost to obtain its state. 

• In Section[8j we derive a 2-approximation for a classic, restless bandit problem called non-preemptive 
machine replenishment 1 1^12311391 . We also show that the Whittle Index for this problem has a 
factor worse performance compared to the optimal pohcy. Thus the technique introduced in this paper 
can be superior to Whittle index or similar policies. 

1.3 Related Work 

Contrast with the Adversarial MAB Problem. While our problem formulations are based on the stochas- 
tic MAB problem, one might be interested in a formulation based on the adversarial MAB liiiMJ- Such 
a formulation might be to assume that rewards can vary adversarially, and that the objective is to compete 
with a restricted optimal solution that always plays the same arm but with the benefit of hindsight. 

These different formulations result in fundamentally different problems. Under our formulation, the 
difficulty is computational: we want to compute policies for playing the arms, assuming stochastic models 
of how the system varies with time, under the adversarial formulation, the difficulty is informational: we 
would be interested in the regret of not having the benefit of hindsight. A sequence of papers show near-tight 
regret bounds in fairly general settings |5l IH |7l [151 [171 [34l [361. However, applying this framework is not 
satisfying: It is straightforward to show that a policy for FEEDBACK MAB that is allowed to switch arms can 
be times better than a policy that is not allowed to do so (even assuming hindsight). Another approach 
would be to define each policy as an "expert", and use the low -regret experts algorithm ifTSl : however, 
the number of policies is super-exponentially large, which would lead to weak regret bounds, along with 
exponential-size policy descriptions and exponential per-step execution time. 
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We note that developing regret bounds in the presence of changing environments has received signif- 
icant interest recently in computational learning ||5l Ul HSl |32l |43l ; however, this direction requires strong 
assumptions such as bounded switching between arms f5\ and slowly varying environments Il32ll43l . both 
of which assumptions are inapplicable to FEEDBACK MAB. In independent work, Slivkins and Upfal H3]| 
consider the modification of FEEDBACK MAB where the underlying state of the arms vary according to a 
reflected Brownian motion with bounded variance. As discussed in B3]| . this problem is technically very 
different from ours, even requiring different performance metrics. 

Other Related Work. The results in l^lll |22l |26l |23 consider variants of the stochastic MAB where the 
underlying reward distribution does not change and only a limited time is allotted to learning about this 
environment. Although several of these results use LP rounding, they have little connection to the duality 
based framework considered here. 

Our duality based framework shows a 2-approximate index policy for non-preemptive machine replen- 
ishment (Section [8]). Elsewhere, Munagala and Shi [39] considered the special case of preemptive machine 
replenishment problem, for which the Whittle index is equivalent to a simple greedy scheme. They show 
that this greedy policy, though not optimal, is a 1.51 approximation. However, the techniques there are 
based on queuing analysis, and do not extend to the non-preemptive case where the Whittle index can be an 
arbitrarily poor approximation (as shown in Section [8]). 

Our solution technique differs from primal-dual approximation algorithms BTll and onhne algorithms ll53l . 
which relax either the primal or the dual complementary slackness conditions using a careful dual-growing 
procedure. Our index policy and associated potential function analysis crucially exploit the structure of the 
optimal dual solution that is gleaned using both the exact primal as well as dual complementary slackness 
conditions. Furthermore, our notion of dual balancing is very different from that used by Levi et al ||35]| for 
designing online algorithms for stochastic inventory management. 

2 The Feedback MAB Problem 

In this problem, first formulated independently in Il25ll54l[30ll40ll . there is a bandit with n independent arms. 
Arm i has two states: The good state gi yields reward r^, and the bad state hi yields no reward. The evolution 
of state of the arm follows a bursty 2-state Markov process which does not depend on whether the arm is 
played or not at a time slot. Let su denote the state of arm i at time t. Denote the transition probabilities 
of the Markov chain as follows: Pr[sj(i_|_i) = gi\sit = bi\ = ai and Pr[sj(i_|_i) = bi\sit = Qi] = The 
ai, Pi, r j values are specified as input. The "burstiness" assumption simply means + /?j < 1 — for 
some small 6 > specified as part of the input. The evolution of states for different arms are independent. 
Any policy chooses at most one arm to play every time slot. Each play is of unit duration, yields reward 
depending on the state of the arm, and reveals to the policy the current state of that arm. When an arm is 
not played, the true underlying state cannot be observed, which makes the problem a POMDP. The goal is 
to find a policy to play the arms in order to maximize the infinite horizon average reward. 

First observe that we can change the reward structure of FEEDBACK MAB so that when an arm is played, 
we obtain reward from the last-observed state instead of the currently observed state. This does not change 
the average reward of any policy. This allows us to encode all the state of each arm as follows. 

Proposition 2.1. From the perspective of any policy, the state of any arm can be encoded as (s, t), which 
denotes that it was last observed t > 1 steps ago to be in state s G {gi, bi}. 

Note that any policy maps each possible joint state of n arms into an action of which arm to play. Such 
a mapping has size exponential in n. The standard heuristic is to consider index policies: Policies which 
define an "index" or number for each state (sj, t) and play the arm with the highest current index. The 
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following theorem shows that playing the arm with the highest myopic reward does not work, and that index 
policies in general are non-optimal. Therefore, our problem is interesting and the best we can hope for with 
index policies is a 0(1) approximation. 

Theorem 2.1. (Proved in Appendix^ For FEEDBACK MAB, the reward of the optimal policy has an ^{n) 
gap against that of the myopic index policy and anVL{\) gap against that of the optimal index policy. 



Roadmap. In this section, we show that a simple index policy is a (2 + e) approximation. This is based 



on a natural LP relaxation suggested by Whittle which we discuss in Section 2.1 this formulation will 



have infinitely many constraints. We then consider the Lagrangean of this formulation in Section 2.2 and 
analyze its structure via duality, which enables computing its optimal solution in polynomial time. At this 
point, we deviate significantly from previous literature, and present our main contribution in Section 2.3 A 
subtle and powerful "balanced" choice of the Lagrange multiplier, which enables the design of an intuitive 
index policy, BalancedIndex, along with an equally intuitive analysis. We use duality and potential 
function arguments to show that the policy is (2 + e) approximation. We conclude by showing that the gap 
of Whittle's relaxation is e/(e — 1) « 1.58, indicating that our analysis is reasonably tight. This analysis 
technique generalizes easily (explored in Sections [4] -[8]l and has rich connections to other index policies, 
most notably the Whittle index (explored in Section [3]l. 



2.1 Whittle's LP 

Whittle's LP is obtained by effectively replacing the hard constraint of playing one arm per time step, with 
allowing multiple plays per step but requiring one play per step on average. Hence, the LP is a relaxation 
of the optimal policy. 

Definition 1. Let va be the probability of the arm i being in state gi when it was last observed in state bi 
exactly t steps ago. Let uu be the same probability when the last observed state was gi. We have: 

'"n = — -^(1 - (1 - «j - A)*) Uit = — H ^^^{l - Oi - (3iY 

OLi + Pi ai + (ii Ui + fii 

Fact 2.2. The functions va and 1 — ua are monotonically increasing and concave functions oft. 

We now present Whittle's LP, and interpret it in the lemma that immediately follows. 

n 

Maximize fi^^gt 
i=l t>i 

x'gi + y\i = Et>i4t«rf + Et>i^it4t 

Ki + yli = Et>i4t(i-^^it) + Et>i(i--"^t)4t 

Vlu^lt > \fi,s e {g,b},t 



Lemma 2.3. The optimal objective to Whittle's LP, OPT, is at least the value of the optimal policy. 

Proof. Consider the optimal policy. In the execution of this policy, for each arm i and state (s, t) for 
s S {g, 6}, let the variable x\^ denote the probability (or fraction of time steps) of the event: Arm i is in state 
(s, t) and gets played. Let y*^ correspond to the probability of the event that the state is (s, t) and the arm is 
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not played. Since the underlying Markov chains are ergodic, the optimal policy when executed is ergodic, 
and the above probabilities are well-defined. 

Now, at any time step, some arm i in state (s, t) is played, which implies the x*^ values are probabilities 
of mutually exclusive events. This implies they satisfy the first constraint in the LP. Similarly, for each arm 
i, at any step, this arm is in some state (s, t) and is either played or not played, so that the x*^ , y*^ correspond 
to mutually exclusive events. This implies that for each i, they satisfy the second constraint. For any arm 
i and state (s, t), the LHS of the third constraint is the probability of being in this state, while the RHS is 
the probability of entering this state; these are clearly identical in the steady state. For arm i, the LHS of 
the fourth (resp. fifth) constraint is the probability of being in state {g, 1) (resp. (6, 1)), and the RHS is the 
probability of entering this state; again, these are identical. 

This shows that the probability values defined for the execution of the optimal policy are feasible for the 
constraints of the LP. The value of the optimal policy is precisely Y17=i Ylt>i "^i^^t' which is at most OPT 
- the maximum possible objective for the LP. □ 

The above LP encodes in one variable x\i the probability the arm i is in state (s, t) and gets played; 
however, we note that in the optimal policy, this decision to play actually depends on the joint state of all 
arms. This separation of the joint probabilities into individual probabilities effectively relaxes the condition 
of having one play per step, to allowing multiple plays per step but requiring one play per step on average. 
While the policy generated by Whittle's LP is infeasible, the relaxation allows us to compute an upper-bound 
on the value of the optimal feasible policy. 

We note = Ylit'yt ^If convenient to eliminate the variables yl^ by substitution and the last two 
constraints collapse into the same constraint. Thus, we have the natural LP formulation shown in Figure [T] 
We note that the first constraint can either be an inequality (<) or an equality; w.l.o.g., we use equality, since 
we can add a dummy arm that does not yield any reward on playing. 



n 






Maximize 




(Whittle) 


1=1 t>i 






J2i=lJ2se{g,b}J2t>l^st — 


1 




2t>i X]se{g,b} < 


1 


Vi 






- Uit) Vz 


Kt > 








Figure 1: The linear program (Whittle) for the feedback MAB problem. 



From now on, let OPT denote the value of the optimal solution to (Whittle). The LP in its current 
form has infinitely many constraints; we will now show that this LP can be solved in polynomial time to 
arbitrary precision by finding structure in the Lagrangean. 

2.2 Decoupling Arms via the Lagrangean 

In (Whittle), the only constraint connecting different arms is the constraint: 

E E E-:*-i 

i=l se{gM t>l 

We absorb this constraint into the objective via Lagrange multiplier A > to obtain the following objective: 

n 

Max. A+G(A) = \ + ^^{r^x'■g^- \{xit+ xl^)) ( LPLagrange(A)) 

1=1 t>i 
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Et>i4t«»t = Et>i 2^34(1 -Wit) 

a^lt > \fi,se{g,b},t 

Through the Lagrangean, we have effectively removed the only constraint that connected multiple arms. 
LPLagrange(A) now yields n disjoint maximization problems, one for each arm i: At any time step, arm 
i can be played (and reward obtained from it), or not played. Whenever the arm is played, we incur a penalty 
A in addition to the reward. The goal is to maximize the expected reward minus cost. Note that if the penalty 
is zero, the arm is played every step, and if the penalty is sufficiently large, the optimal solution would be to 
never play the arm. 

Definition 2. For each arm i, let Li{\) denote the optimal policy, and let Hi{\) denote the optimal reward 
minus penalty. Note that the global reward minus penalty is the sum for each arm: G{X) = Yli=i Hi{X). 

2.2.1 Characterizing the Optimal Single-arm Policy 

We first show that the optimal policy Lj(A) for any arm i belongs to the class of policies Vi{t) for t > 1, 
whose specification is presented in Figure[2] Intuitively, step (1) corresponds to exploitation, and step (2) to 
exploration. Set 'Pj(oo) to be the policy that never plays the arm. 

Policy 

1. If the arm was just observed to be in state g, then play the arm. 

2. If the arm was just observed to be in state b, wait t — 1 steps and play the arm. 

Figure 2: The Policy Vi{t). 
To show this, we take with the dual of LPLagrange(A): 

n 

Minimize X + '^hi Whittle-Dual(A) 

1=1 

A + th^ > Ti - (1 - Uit)pi \/i,t>\ 

A + thi > VitPi \/i,t>\ 

h, > Vi 

The fact that the optimal single arm policy Li{X) belongs to the class {Pj(t)} comes from (5) of the 
following lemma. 

Lemma 2.4. For any A > 0, in the optimal solution to Whittle-Dual(A), for any arm with hi > 0.' 

1. = Hi{X). 

2. Pi > 0. 

3. For some ti > 1, x^j. > and X + Uhi = va^pi. 

4. x'^gi > and X + hi = ri — (3pi. 

5. The optimal single-arm policy for arm i is Li{X) = Vi{ti). 
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Proof. The first part follows the definition of strong duality. The problem LPLagrange(A), ignoring the 
constant A in the objective, separates into n separate LPs, one for each arm. The dual objective for arm i is 
precisely hi, which must be the same as the primal objective, Hi{X). 

If hi = Hi{X) > 0, the solution to the LP for arm i is the policy Li{\). In order to have non-zero Hi{\), 
such a policy must play the arm first in some state (6, ti) and state {g, t'J. Since is the probability this 
policy plays in state (s, t), this implies x\^_ > and > 0. 

Since xl^, > 0, by complementary slackness, we have A + tj/ij = vHiPi- Since the LHS is at least zero, 
this implies pi > 0. This proves parts (2) and (3). 

To see part (4), observe that for the set of constraints A + t/ij > — (1 — Uit)pi, since 1 — un is 
a monotonically increasing function of t, the RHS is monotonically decreasing in t. Since the LHS is 
monotonically increasing, if the LHS and RHS are equal, they have to be so for t = 1. Now, since x%, > 0, 
by complementary slackness, A + t\hi = — (1 — Un'_)pi. By the above argument, t'- = 1, which completes 
the proof of part (4). 

Since x^^^ > and > 0, the optimal policy for -Lj(A) plays the arm in state {g, 1) and in state (6, ti), 
which is precisely the description of Vi{ti). This proves part (5). □ 

It will be instructive to interpret the problem Li{\) as follows: Amortize the reward so that for each 
play, the arm i yields a steady reward of A. The goal is to find the single-arm policy that optimizes the 
excess reward per step over and above the amortized reward A per play. As we have shown above, the 
optimal value for this problem is precisely Hi{\), and the policy Lj(A) that achieves this belongs to the 
class {Vi{t),t > 1}. 

2.2.2 Solving LPLagrange(A) 

Having decomposed the program LPLagrange(A) into independent maximization problems for each arm, 
and having characterized the optimal single-arm policies, we can now solve the program in polynomial time. 
It will turn out this can be solved by simple function maximization via closed form expressions. 

Definition 3. For policy Vi{t), let Ri{t) denote the expected per-step reward, and let Qi{t) denote the 
expected rate of play. Let Fi(\,t) = Ri{t) — \Qi{t) denote the value ofVi{t). Also define: 

ti{X) = argmaXfyiFi{X,t) = argmax^yiRiit) - XQi{t) (1) 

Finally let Ri{\) = Ri{ti{\)) and Qi{X) = Qi{ti{X)) 

Note that the optimal reward minus cost for arm z is simply A) = mayityi Ri{t) — XQi{t) = Ri{ti) — 
^Qi{ti). Since each Vi{t) corresponds to a Markov Chain, it is straightforward to obtain closed form 
expressions for Ri{t) and Qi{t). 

Lemma 2.5. In playing an arm with reward r, transition probabilities a and (3, the policy V{t) yields 
average reward R{t) = r ^^^^p , and expected rate of play Q{t) = ^^^j^p > j. Recall that vt = — 
(1 — a — PY) is the probability the arm is good given it was observed to be bad t steps ago. 

Proof. The Markov chain describing the policy V{t) is shown in Figure [5] and has t + 1 states which we 
denote s,0,l,2,...,t — 1. The state s corresponds to the arm being observed to be in state g, and the state 
j corresponds to the arm being observed in state h exactly j steps ago. The transition probability from state 
j to state j + 1 is 1, from state s to state is (3, from state t — 1 to state s is w^, and from state t to state 
is 1 — Let TTs, ttq, TTi, . . . , TTt^i dcnotc the steady state probabilities of being in states s,0,l,...,t — 1 
respectively. This Markov chain is easy to solve. We have ttq = vri . . . = tti-i, so that the first identity is: 
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Figure 3: Markov Chain for policy V{t). 



TTs + tiTo = 1. Furthermore, by considering transitions into and out of s, we obtain: Ptts = vtTTt-i = vittq. 
Combining these, we obtain: -Kg = ^^^tp ' ^"^^ ~ vt+tg - have: 

R{t) = r[(l - (5)'Ks + vt-Ko] = r-Ks = r — — — 

Q(t) =TTs + 7rt_i = — — 
Vt + 1(3 

□ 

Lemma 2.6. (Proved in Appendix^ For each arm i, the optimal reward minus penalty of the single arm 
policy for arm i is 

ufw T7f\ -^\ f {ri - \)vit - \f3i 

rii{A) = maxi^j(A, rj = max 



t>l t>l \ Vit + tPi 

The maximum value ti{\) = argmax^y^Fi{\, t) satisfies the following: 
1. IfX > Ti L^p^^l^^p^X then ti{X) = oo, and Hi{X) = 0. 



2. IfX = n[- 

.+^°^.+^-) ) — Pfor some p > 0, then tj(A) (and hence Hi{X)) can be computed in time 
polynomial in the input size and in log(l/p) by binary search. 



2.3 The BalancedIndex Policy 

Though we could now use LPLagrange(A) to solve Whitde's LP by finding the A so that Yll'=i QiW ^ 1 



(refer Appendix A.3 for details), our 2-approximation policy will not be based this approach. For our 
analysis to work, we must make a subtle but crucial modification: We will instead set A to be the sum of 
the excess reward for all single-arm policies Y17=i I^iW- (Recall that we can interpret A to be a penalty 
per play, so in the optimal single-arm policy for arm i, Hi{X) is the average reward minus penalty.) Note 



that by Lemma 2.7 this implies A > OPT/2 and Yl'i=i ^ii^) ^ OPT/2. Intuitively, we are forcing the 
Lagrangean to balance short-term reward (represented by A) with long-term average reward (represented 
by Hi{X)). Our balance technique can be generalizes to many other restless bandit problems (see 

Sections |4]-[8]l. 

We first show how to compute this value of A in polynomial time. We begin by presenting the connection 
between G(A = XlILi HiW and OPT, the value of the optimal solution to (Whittle). 
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Lemma 2.7. For any X, we have: A + G(A) = A + Y.7=i Hi{X) > OPT. 



Proof. By Lemma 2.4 part (1), we have: A + Y17=i = A + hi. The latter is the objective of 

the dual of (Whittle), which implies the lemma by weak duality. □ 

Lemma 2.8. hi = Hi{\) is a non-increasing function of X. 



Proof. Recall from Lemma 2.4 that hi = Hi{X). For any A, consider the value Fi{X, t) = Ri{t) — XQi{t) 



of the policy Vi{t). Since this decreases as A increases, Hi{X) = maxt Fi{X,t) is also a non-increasing 
function of A. □ 



Lemma 2.9. In polynomial time, we can find a X so that A > (1 
Eti > OPT/2. 



e)0PT/2, andG{X) = Eti^iiX) 



Proof. First note by Lemma 2.8 that G(A) 
start with A 



Y17=i ^ii^) is monotonically non-increasing in A. Therefore, 



Y17=i "^i' ^'^'^ ^ scale down by by a factor of (1 + e) until A < G(A). Note that for any A, 
the value of G{X) can be computed in poly-time by Lemma 2.6 At this point, let A' = A(l + e). Since 
G(A') < A', by Lemma [2?7| we have A' > OPT/2, which implies A > (1 - e)0PT/2. Further, since 
A < G(A), again by Lemma O we have G{X) > OPT/2. □ 



2.3.1 Index Policy 



We start with the value of A from Lemma 2.9 The poUcy only works with the subset of arms S so that for 
i £ S, we have Hi{X) > 0. For this A, the solution to LPLagrange(A) yields one policy Vi{ti{X)) of 
value Hi{X) for each arm i G 5 (see Lemma 2.4 1. Let tj = ti(A). Recall that if an arm was last observed 
in state s E {g, b} some t > 1 steps ago, then its state is denoted (s, t). We call an arm i in state {g, 1) as 
good; in state (6, t) for t > tj as ready, and in state {b, t) for t < t j as bad. The policy is shown in Figureffl 



BalancedIndex Policy for Feedback MAB 

Consider only arms with Hi{X) > 0; denote these as set S. 
Play arms in S according to the following priority scheme: 

1. Exploit: Play any arm in state (5, 1) (good state): 

2. If condition (1) does not hold: 

(a) Explore: Play any arm i G S in state {b, t) such that t > ti (ready state). 

(b) Idle: If no arm is good or ready (all arms bad), do not play at this step. 



Figure 4: The BalancedIndex Pohcy for Feedback MAB. 

Note that the way the scheme works, at most one arm can be in state {g, 1) at any time step, and if such 
an arm exists, this arm is played at the current step (and in the future until it switches out of this state). 
The above can be thought of as executing the policies Vi{ti) for arms i £ S independently and in case of 
simultaneous attempts to play, resolving conflicts according to the above priority scheme. 

Though the above policy is not written as an index policy, it is equivalent to the following index: There 
is a dummy arm with index that does not yield reward on playing. If hi = Hi{X) = 0, the index for all 
states of this arm is —1. For arms with hi > 0, the index for state {g, 1) is 2; that for states (6, t) with t > ti 
is 1, and that for states (6, t) with t < ti is —1. 
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2.3.2 Analysis 

We now prove that the BalancedIndex pohcy is in fact a 2-approximation. The proof is based on the fact 
that the Lagragean A and the excess rewards hi = Hi{\) give us a way of accounting for the average reward. 



And by Lemma 2.7 A > OPT/2 and > OPT/2, which gives us a way of Unking the rewards from 



our policy to the LP optimum. 

Theorem 2.10. The BalancedIndex policy is a2 + e approximation to Feedback MAB. Furthermore, 
this policy can be computed in polynomial time. 

Proof. Recall that the reward of optimal single arm policy Vi{ti) is Ri{X) = Hi{\) + AQi(A), so that this 
reward can be accounted as Hi{\) = hi per step plus A per play. We use this amortization of rewards to 
show that the average reward of our index policy is at least OPT/ 2. 

Focus on any arm i, we call a step blocked for the arm if the arm is ready for play-the state is (6, t) 
where t > tj-but some other arm is played at the current step. Consider only the time steps which are not 
blocked for arm i. For these time steps, the arm behaves as follows: It is continuously played in state {g, 1). 
Then it transitions to state (6, 1) and moves in t j — 1 time steps to state (6, ti — 1). After this the arm might 
be blocked, and the next state that is not blocked is {b, t) for some t > ti, at which point the arm is played. 



Using the formula for R{t) from Lemma 2.5 and since va > va^ for t > ti, we have 



Ri{t) > n — > n — "^^^^ = Ri{ti) 
^ ' - Vit + til3^ - \iu+ti(5i " 

which implies that the per-step reward of this single arm policy for arm i restricted to the non-blocked 
time steps is at least the per-step reward Ri{ti) of the optimal single-arm policy Vi{ti). Therefore, for these 
non-blocked steps, the reward we get is at least hi = Hi{\) per step, and at least A per play. 

Now, on steps where no arm is played, none of the arms is blocked by definition, so our amortization 
yields a per-step reward of at least ^jg^ hi > OPT /2. On steps when some arm is played, the arm that 
is played by definition cannot not blocked, so we get a reward of at least A > (1 — e) OPT/2 for this step. 
This completes the proof. □ 

2.3.3 Alternate Analysis 

The above analysis is very intuitive. We now present an alternative way to analyze the policy, that leads to a 



more generalizable technique. This uses a Lyapunov (potential) function argument. Recall from Lemma 2.4 
that hi = Hi{\); further that ti = tj(A). Define the potential $j for each arm i at any time as follows: 

Definition 4. If arm i moved to state b some y steps ago (y > 1), the potential ^i is /ij(min(y, ti) — 1). In 
the state gi the potential is pi. Recall that pi is the optimal dual variable in Whittle-Dual(A). 

Let denote the total potential, Yll^=i ^^^P ^ ^^'^ denote the total reward accrued until 

that step. Define the function £r = t-OPT/2-RT-^T- LetARr = Rt+i- Rt and A^t = ^t+i-^t- 

Lemma 2.11. Ct is a Lyapunov function, i.e., E[£r-|-i|£r] < Ct. Equivalently, at any step: 

E[Ai?T + A^t\Rt, > (1 - e)OPT/2 

Proof. At a given step, suppose the policy does nothing, then all arms are "not ready". The total increase in 
potential is precisely 

A$T = ^hi = G{\) > OPT/2 
ieS 
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On the other hand, suppose that the poUcy plays arm i, which has last been observed in state b and has 
been in that state for y > U steps. With probability q > vu^ the observed state is gi and the change in reward 
ARt = ri and the change in potential is pi — hi{ti — 1). With probability 1 — q the observed state is b and 
the change in potential is —hi{ti — 1) (and there is no change in reward). Thus in this case since q > fit-, 
and Pi > 0, we have: 

E[Ai?T + A$t|-Rt, ^t] > qPi - hiiU - 1) > vu^pi - htiU - 1) > A + /i^ > (1 - e)OPT/2 



The penultimate inequality follows from Lemma 2.4 part (3). Note that the potentials of arms not played 
cannot decrease, so that the first inequality is valid. 

Finally supposing the policy plays an arm i which was last observed in state gi and played in the last 
step, with probability 1 — Pi the increase in reward is rj and the potential is unchanged. With probability Pi 
the potential will decrease by —pi. Therefore in this case, by Lemma [2!4{ part (4). 



E[ARt + A^t\Rt, '^t] >ri- piPi >X + hi>{l- e)0PT/2 

□ 

By their definition, the potentials are bounded independent of the time horizon, by telescoping 
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summation, the above lemma implies that limx^oo ^ > (1 — e)0PT/2. This proves Theorem 

Gap of Whittle's LP. The following theorem shows that our analysis is almost tight (considering that our 
2-approximation is against Whittle's LP). 

Theorem 2.12. (Proved in Appendix^ The gap of Whittle's LP is arbitrarily close to e/{e — 1) ~ 1.58. 



3 Analyzing the Whittle Index for Feedback MAB 

Before generalizing our 2-approximation algorithm to a larger subclass of restless bandit problems, we 
explore the connection between our analysis and the well-known Whittle Index used in practice. This 
section can be skipped without losing continuity of the paper. 

A well-studied index policy for restless bandit problems is the Whittle Index ll52ll . In the context of 
Feedback MAB, this index has been independently studied by Le Ny et al ll40ll and subsequently by 
Liu and Zhao [37]. Both these works give a closed form expressions for this index and show near-optimal 
empirical performance. Our main result in this section is to justify the empirical performance by showing 
that a simple but very natural modification of this index in order to favor myopic exploitation yields a 2- 
approximation. The modification simply involves giving additional priority to arms in state {g,l) if their 
myopic expected next step reward rj(l — /3j) is at least a threshold value. 



3.1 Description of the Whittle Index 

Defined in general, the Whittle's index for each state x is the largest penalty-per-play A such that the optimal 
policy is indifferent between playing in x and not playing. In our specific problem, the current state for each 
arm i is captured by the tuple (s, t)- the arm was last seen to be s G {g, b} (good or bad) t steps ago. 
The Whittle index Ili{s, t) is a non-negative real numbers computed as follows: using the notation from 



Section 2.2 , for any penalty per play A, there is a single-arm policy Li{X) that maximizes the average 
reward minus penalty (excess reward) Hi{X) over the infinite horizon. When A = oo, the optimal policy 
never plays; when A = 0, the optimal policy would play in any state. As A is decreased from oo, at 
some value A*, the decision in state (s, t) changes from "not play" to "play". The Whittle index Ili{s, t) is 
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Whittle Index Policy: Play the arm i whose current state (s, t) has the highest index 11^(5, t). 



Figure 5: The Whittle Index Policy. 



precisely this value of A*. The Whittle index policy always plays the arm with the highest Whittle's index 
(Fig. (5]). 

Remarks. The Whittle index is strongly decomposable, i.e., can be computed separately for each arm. 
Further, we have defined A as a penalty (or amortized reward) per play, while Whittle defines it as a reward 
for not playing (which he terms the subsidy for passivity); it is easy to see that both these formulations are 
equivalent. Finally, for FEEDBACK MAB, it can be shown B0ll37l that for any state {s, t), there is a unique 
A € (— cxD, oo) where the decision switches between "play" and "not play", i.e., the decision is monotone in 
A. Strictly speaking, the Whittle index is defined only for such systems (termed indexable by Whittle I52J); 
we will define this aspect away by insisting that the index A* is the largest value where a switch happens. 

We present an explicit connection of Whittle's index to LPLagrange(A). 

Lemma 3.1. (Proved in Appendix^ Recall the notation Li{\) and Vi{t) from Section 2.2 The following 
hold for Iii{s,t): 

1. nj(s, t) > 0/or all states (s, t) where s G {g, h} and t G Z"^. 

2. Ui{g,l) =ri{l- Pi),andUi{b,t) <ni{g,l)forallt>l. 

3. Ili{b, t) = max{A|Lj(A) = Vi{t)}, and is a monotonically non-decreasing function oft. 

Though Whittle's index is widely used, it is not clear how to analyze it since it leads to complicated 
priorities between arms. We now show that our balancing technique also implies an analysis for a slight but 
non-trivial modification to Whittle's index. 

3.2 The Threshold-Whittle Policy 

We now show that modifying the index slightly to exploit the myopic next step reward in good states g yields 
a 2 approximation. Note that the myopic next step reward of an arm i in state g is precisely Iii{g,l) = 
rj(l — j3i). The modification essentially favors exploiting such a "good" state if the myopic reward is at 
least a certain threshold value. In particular, we analyze the policy Threshold-Whittle(A) shown in 
Figure [6j where we set A = A*, where A* is the value where A* = Ylll=i Hi{X*) (refer Section 2.3 1. 



Threshold- Whittle( A) 
At any time step: 

If 3 arm i in state (g, 1) whose Whittle index is Ii-i{g, 1) = ri{\ — Pi) > A 
then Play arm i. 

else Play the arm with the highest Whittle index. 



Figure 6: Policy Threshold-Whittle(A). It exploits arm i if the myopic reward in state [g, 1) is > A. 

Note that the above policy can be restated as playing the arm with the highest modified index, which is 
computed as follows: For arm i, iflli{g, 1) = rj(l — Pi) > A, the modified index for state {g, 1) is oo, else 
the modified index is the same as the Whittle index. 

Theorem 3.2. Threshold-Whittle(A*) is a 2 approximation for FEEDBACK MAB. Here, A* satisfies 



A* = J27=iHi{X*) (refer Section 2.3) 
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3.3 Proof of Theorem |3J] 

We now prove the above result by modifying our analysis of the BalancedIndex policy (from Figure |4]l. 
Recall that 5 is the set of arms with /ij > in the optimal solution to Whittle-Dual(A*). For such arms, 
t = is the first time instant when \ + thi > piVu is tight. For arm i £ S, state (s, t) is good if s = g and 
t = 1; ready if s = b and t > tf, and bad otherwise. The index policy from Figure|4]favors good over ready 
states, and does not play any arm in bad states. 

Claim 3.3. For any arm i, exactly one of the following is true forV^mTTLE-DUAL{X*) <3«(iLPLAGRANGE(A* 

1. The constraint A* + thi > vuPi is first tight at t = ti. Then, fli{h, — 1) < A* and nj(6, tj) > A*. 
Further, Iii{g, 1) = riil - (3i) > X* and hi > 0. 

2. The constraint A* + thi ^ vaPi is not tight for any t. Then, Ili{b, t) < X* for all t > 1, and hi = 0. 
Proof. The optimal solution to LPLagrange(A*) finds the policy Vi{ti) for every arm i with hi > 0. 



Therefore, by Lemma 3.1 we must have Iii{h,ti) > A*, and Ili{b,t) < X* for all t < ti. Furthermore, 



since the variable xl^ in the optimal solution to LPLagrange(A*) is first non-zero at t = ti, this implies 



the constraint A* + thi > vuPi is first tight att = ti by complementary slackness (Lemma 2.4 1. Further, if 



this constraint is tight att = ti, since va is monotonically increasing, the constraint is feasible for all t >ti 



only if hi > 0. Finally, Ui{g, 1) = ri(l - Pi) > Ui{b, ti) > X* follows from Lemma 3.1 

Suppose now that A* + thi > vaPi is not tight for any t > 1. Then, by complementary slackness, we 
have xl^ = for all t > 1, which implies x^^ = for all t > 1. Therefore, the policy Lj(A*) never plays 
arm i. This implies Ili{b, t) < X* for all t > 1. Since the excess reward of Li{X*) is zero, we have hi = 0. 
(This can also be shown by complementary slackness.) □ 

3.3.1 Types of Arms 



We next classify the arms as follows. In Claim 3.3 let the arms satisfying the first condition {hi > 0) of the 
Claim be denoted Type (1), and the remaining arms satisfying /ij = be denoted Type (2). Note that type 
(1) arms are precisely the set S of arms in Fig.[4j so the BalancedIndex policy only plays type (1) arms. 

Type (1): Arms in 5. We consider the behavior of Threshold-Whittle(A*) restricted to just these 
arms. Since Iii{b, t) is monotonically increasing in t, by Claim [33| we have the following for the policy of 
Fig. |4j If the arm is ready, the Whittle index is at least A*; if the arm is bad, the index is at most A*; and 
finally, if the arm is good, then the modified Whittle index is infinity. 

Therefore, Threshold-Whittle(A*) confined to these arms gives priority to good over ready over 
bad arms. The only difference with the policy in Fig. [4] is that instead of idling when all arms are bad, the 
policy Threshold-Whittle(A*) will play some bad arm. We now show that this is better than idling. 

Claim 3.4. Threshold-Whittle(A*) executed just over Type (1) arms yields reward at least OPT/2. 



Proof. Consider the alternate analysis presented in Section 2.3.3 The Index policy from Fig. |4] does not 



play an arm i in bad state, and achieves change in potential A<1> of exactly hi. All we need to show is that 
if the arm is played instead, the expected change in potential is still at least hi. The rest of the proof is the 
same as that of Lemma [2. 11 Suppose the arm is played after t > 1 steps. The expected change in potential 



is: E[A<I>(] = VitPi — hi{t — 1). We further have by definition of ti that A + tj/ij = va^Pi. We therefore 
have piVit^ > tihi. Since va is a concave function of t with ViQ = 0, the above implies that for every t < ti, 
we must have piVit > thi. Therefore, E[A$j] = vnPi — hi{t — 1) > thi — hi{t — 1) = hi. □ 

Type (2): Arms not in S. The only catch now is that Threshold-Whittle(A*) can sometimes play a 
type (2) arm whose hi = 0. For such arms, we count their reward and ignore the change in potential. 
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Lemma 3.5. In Threshold-Whittle(A*), if a type (2) arm j preempts the play of a type (1) arm i, either 
the reward from the former is at least A* or the increase in potential of the later A$ is at least hi. 

Proof. Suppose that for type (2) arm j, Hj{g, 1) = rj{l — (3j) > A*. Denote such a state {g, 1) as nice, and 
a nice type (2) arm has modified index of oo. When j was last observed to be good, the arm can be played 
continuously even if type (1) arms become ready. However, for every time step such an event happens, the 
current reward of playing this nice type (2) arm is precisely rj{l — j3j), which is at least A*, and the type 
(1) arms only get better from waiting. When the type (2) arm j was last observed to be bad, preemption can 
only happen if all type (1) arms are bad, since the Whittle's index of a type (2) arm 11^ (6, oo) < A*. But in 
this case the increase in potential of arm i is hi. 



Finally, if Ilj{g, 1) < A*, then by Lemma 3.1 we have Ilj{s, t) < X* for all s = b,g and t > 1. This 
implies that such an arm in any state can only preempt type (1) arms that are bad; in that case, the potential 
^ of the latter rises by hi by idling. This completes the proof. □ 



3.3.2 Completing the Proof of Theorem 3.2 



To complete the analysis, there are two cases: First, if a nice type (2) arm, or a ready or good type (1) arm 
is played, then the above discussion implies that the reward plus change in potential (A$) of this arm is at 
least A* > OPT/2. In the other case, all type (1) arms are bad, and focusing on just these arms, each yields 
increase in potential for each arm is at least hi, so that the total reward plus change in potential of these 
system is at least hi > OPT/2. This completes the proof, and shows that THRESHOLD- Whittle( A*) 
is a 2 approximation. We note that the above analysis extends easily to the variant where M > I arms are 
simultaneously played per step. 

4 The General Technique: Monotone Bandits 

In this section, we present a general and non-trivial sub-class of restless bandits for which a generalization of 
the above balancing technique yields a 2-approximate index policy. We term this class MONOTONE bandits, 
and this captures both the stochastic MAB, as well as the FEEDBACK MAB as special cases. 

In Monotone bandits, there are n bandit arms. Each arm i can be in one of K states denoted Si = 
{cjf , o"!, . . . , o")^}- When the arm is not played, its state remains the same and it does not fetch reward. 
Suppose the arm is in state cr^ and is played next after t > 1 steps. Then, it gains reward > 0, and 
transitions to one of the states a* / al w.p. g^{k,j, t), and with the remaining probability stays in state cr^. 
(For notational convenience, we denote a], simply as k; the arm it refers to will be clear from the context.) 
The transition probabilities for different arms are independent. At most one arm is played per step. The goal 
is to find a policy for playing the arms so that the infinite horizon time-average reward is maximized. 

In addition, we have the following key properties about the transition probabilities: 

Separability Property: We assume that g'^{k, j, t) is of the form fl{t)q^{k, j). The function fKt) G [0, 1] 
for positive integers t can be thought of as an "escape probability" from the state G 5j. Conditioned 
of the escape, the state changes to aj G Si with probability q^{k,j), thus J2j:/^k l^i^^j) ^ 1- 

Monotone Property: For every arm i and state k G Si, we have: fl{t) < fl{t + 1) for every t. 



The above properties are necessary in some sense: We show in Section 4.5 that when the monotone 
property is relaxed, the problem becomes n'^-hard to approximate. Further, if the separability property is not 
satisfied, then Whittle's LP on which the analysis of this section is based, has r2(n) gap. 

Motivation and Special Cases. Intuitively, Monotone bandit models optimization scenarios in which 
uncertainty increases: when an arm is just played and we observe its state, we are most certain that our 
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observation still holds true the next time step. However, the non-decreasing nature of / implies that as time 
goes on, the "escape probability" increases and the previous observation becomes less and less reliable. This 
serves as a model for certain POMDPs, such as the FEEDBACK MAB. 

Observe that the MONOTONE bandits generalizes the FEEDBACK MAB. For the states Si = {g,b}, 
set q^{g,b) = q^{b,g) = 1 and fg{t) = 1 — uu and fl{t) = vu- Recall from Fact 2.2 that Uit,Vit are 



respectively the probabilities of observing the state g when the state last observed t steps ago was g and b, 
and that 1 — uu, vu are both monotonically increasing. We also note that MONOTONE bandits generalizes 
the stochastic MAB by setting /|(t) = 1 for all t. 

4.1 High Level Idea 

Unlike the FEEDBACK MAB problem, in MONOTONE bandits, there is no longer a clear distinction between 
"good" and "bad" states. Note however that an equivalent way of finding A such that A = Y17=i Hi{X) is 
to treat A as a variable and enforce A = Y17=i ^ constraint in the dual of Whittle's LP. By taking this 
approach, the variables (now one for each state k £ Si) can be interpreted as dual potentials, and the dual 
constraints are in terms of the expected potential change of playing in state k G Si. Based on the sign of this 
potential change, we can classify the states into "good" and "bad" via complementary slackness. Our index 
policy continuously exploits arms in "good" states, and waits until the dual constraint goes tight (i.e., the 
arm becomes "ready") before playing in "bad" states. We formalize the previous potential-based argument 
using a Lyapunov function and show a 2-approximation. We note that the LP-duality approach is entirely 
equivalent to the Lagrangean approach; however, it leads to a different interpretation of variables which is 
more generalizable. 

Technical Assumptions. For simplicity of the exposition, we assume the monotone functions in this section 
are piece- wise linear with poly-size specification - see definition [5] for a formal definition. As shown in 
the previous section, these results do extend to a wider class of differentiable functions, such as those in 
Feedback MAB. 

We also assume that for each arm i, the graph, where the vertices are k ^ Si and a directed edge (j, k) 
exists if k) > 0, is strongly connected. Since we consider the infinite horizon time average reward, 
assume that the policy is ergodic and can choose the start state of each arm. These assumptions do not 



simplify the problem, as it remains NP-Hard (see Section 4.5 1. 



4.2 Whittle's LP and its Dual 

As with Feedback MAB, for each arm i and k s Si, we have variables {x\.^,t > 1}. These variables 
capture the probabilities (in the execution of the optimal policy) of the event: Arm i is in state k, was last 
played t steps ago, and is played at the current step. These quantities are well-defined for ergodic policies. 
Whittle's LP is presented in Figure [7] Let its optimal value be denoted OPT. The LP effectively encodes 
the constraints on the evolution of the state of each arm separately, connecting them only by the constraint 
that at most one arm is played in expectation every step. The first constraint simply states that the at any 
step, at most one arm is played; the second constraint encodes that each arm can be in at most one possible 
state at any time step; and the final constraint encodes that the rate of entering state A; G 5j is the same as the 
rate of exiting this state. This LP will clearly be a relaxation of the optimal policy; the details are the same 



as the proof of Lemma 2.3 



This LP has infinite size, and we will fix that aspect in this section. In particular, we now show that the 
LP has polynomial size when the are piece-wise linear with poly-size specification. 

Definition 5. Given i, k £ Si, fKt) is specified as the piece-wise linear fiinction that passes through break- 
points (ti = 1, /^(l)), (^2, flih)), {tm, flitm)). Denote the set {ti,t2, ...,tm} as Therefore, for 
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n 

Maximize X/ X/ X/ ^k^kt (Whittle) 




i=l keSi t>l 




J2i=iJ2keSi^t>i-''kt — 1 




J2keSi^t>l^^kt - 1 


Vi 






4t > 


Vi, fc, t 



Figure 7: The linear program (Whittle). 



two consecutive points ti,t2 G w/f/z ti < t2, the function fl is specified at ti and ^2- For t G (ti, ^2). 
we have fl{t) = {{t2 - t)fl{ti) + {t - ti)/^(t2))/(t2 - h). Fort > tm, we have flit) = fl{tm). We 
assume that Wl has poly-size specification. 

Consider the dual of the above relaxation. The first constraint has multiplier A, the second set of con- 
straints have multipliers hi, and the final equality constraints have multipliers p^,. For notational conve- 
nience, define: 

APl= ^ {q\k,j)ip)-pl)) (2) 
jeSij^k 

Note that APJ is a variable that depends on the dual variables pi . We obtain the following dual. 

n 

Minimize A + hi 

i=l 

X + th, > rl + fl{t)APl yi,k£Si,t>l 
\h, > 

Since fl{t) is piece-wise linear, for two consecutive break-points ti < t2 in Wl, the constraint \+thi > 
r^ + /^(t)AP^ is true for all t G [^1,^2] iff it is true at ti and at t2- This means that the constraints for t ^ Wl 
are redundant. Therefore, the above dual is equivalent to the the one presented in Figure [8j which we denote 
(Whittle-Dual). 



n 




Minimize A + hi 


(Whittle-Dual) 


X + th, > Tl + flit) API 


Vz,fc e e Wl 


X,hi > 





Figure 8: The polynomial size dual of Whittle's LP, which we denote (Whittle-Dual). 



Taking the dual of the above program, we finally obtain a polynomial size relaxation for MONOTONE 
bandits. Since this poly-size LP only differs from (Whittle) in restricting t to lie in the relevant set W^,, 
and since it will not be explicitly needed in the remaining discussion, we omit writing it explicitly. 

4.3 The Balanced Linear Program 

We do not solve Whittle's relaxation. Instead, we solve the modification of (Whittle-Dual) from Fig- 
ure [8] which we denote (Balance). This is shown in Figure |9] The additional constraint in (Balance) 
(as in Feedback MAB) is the constraint A = Yll'=i 
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n 




Minimize 


X + J^hi 


(Balance) 


X + thi > 


ri + flit) API 


yi,k€Si,teWl 


X = 


Er=i 




A, hi > 





Vi 



Figure 9: The dual linear program (Balance) for MONOTONE MAB. 



The primal linear program corresponding to (Balance) is the following (where we place an uncon- 
strained multiplier lo to the final constraint of (BALANCE)): 

n 

Maximize ^ ^ ^ ''^l^lt (Primal-Balance) 

Sj=i ^keSi Stew* ^kt — 1 ^ 

4, > ' \/i,k,t 

The first step of the algorithm is to solve the hnear program (BALANCE). Clearly the value of this 
LP is at least OPT. We now show the following properties of the optimal solution to (Balance) using 
complementary slackness conditions between (BALANCE) and (Primal-Balance). From now on, we 
only deal with the optimal solutions to the above programs, so all variables correspond to the optimal 
setting. 

Lemma 4.1. Recall that OPT is the optimal value to (Whittle). Since any feasible solution to (BAL- 
ANCE) is feasible to (Whittle-Dual), in the optimal solution to (BALANCE), A = EiLi ^ OPT/2. 

The next lemma is the crux of the analysis, where for any arm being played in any state, we use comple- 
mentary slackness to explicitly relate the dual variables to the reward obtained. Note that unlike the analyses 
of primal-dual algorithms, our proof needs to use both the exact primal as well as dual complementary slack- 
ness conditions. This aspect requires us to actually solve the dual optimally. 

Lemma 4.2. One of the following is true for the optimal solution to (BALANCE); Either there is a trivial 
2-approximation by repeatedly playing the same arm; or for every arm i with hi > and for every state 
k G Si, there exists t G such that the following LP constraint is tight with equality. 

\ + thi>ri + fi{t)APi (3) 

Proof. Note that if w < -1 or a; > 1, then the values of (Primal-Balance) is 0, but the optimal value 
of (Primal-Balance) is at least OPT > 0. Thus, in the optimal solution to (Primal-Balance), 

a;G (-1,1). 

The optimal solutions to (BALANCE) and (Primal-Balance) satisfy the following complementary 
slackness conditions (recall from above that a; > — 1 so that 1 + a; > 0): 

hi>0 ^ t4t = 1 + a; > (4) 

X + thi>ri + fmAPl xL = (5) 
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Suppose that for some i such that hi > 0, and for some k G Si, we have A + t/ij > + fl{t)APl for 
every t G W^,. By condition = Vt G W^, which trivially implies that xlJl^t) = Vt G W^. 

Now, given that for a certain arm i and state A;, x\^fl,{t) = Vt. Therefore, in the following constraint 
in (Primal-Balance): 

E E = E E 4/;W'zx^,fc) 

the LHS is zero because = 0, which means the RHS is zero. Since all variables are non-negative, 

this implies that for any j G Si with q^{j, k) > 0, we have = for all t G Wj. 

Recall (from Sectionjijl that we assumed the graph on the states with edges from j to k if q^{j, k) > Oh 
strongly connected. Therefore, by repeating the above argument, we get t G Wj, x^j^fj{t) = 0. 

By Condition (|4|, since hi > 0, there exists j G Si and t G Wj, such that x*^ > (or else the sum 
in Condition (jiji is zero). By what we proved in the previous paragraph, this implies that fj{t) = 0, which 
implies that = by the MONOTONE property. Since x*^ > 0, using Condition (pjl and plugging 

in fj{t) = 0, we get X + thi = r*. Moreover, by plugging in = into the t = 1 constraint of 

(Balance), we get A + /ij > r* . These two facts imply that A + /ij = r*. The above implies that the policy 
that starts with arm i in state j and always plays this arm obtains per-step reward X + hi > OPT/2. □ 

In the remaining discussion, we assume that the above lemma does not find an arm i that yields reward 
at least OPT/2. This means that Vi, k, there exists some t G that makes Inequality ^ tight. 

Lemma 4.3. For any arm i such that hi > 0, and state k G Si, ifAPJ^ < 0, then: 

X + hi = rl + fl{l)APl 



Proof. By Lemma 4.2 and our assumption above. Inequality ([3|l in Lemma 4.2 is tight for some t G W^. If 
it is not tight for t = 1, then since fl{t) is non-decreasing in t and since APJ^ < 0, it will not be tight for 
any t. Thus, we have a contradiction. □ 

4.4 The BalancedIndex Policy 



Start with the optimal solution to (BALANCE). First throw away the arms for which hi = 0. By Lemma 4.1 
for the remaining arms, Yli hi > OPT/2. Define the following quantities for each of these arms. 

Definition 6. For each i (hi > by assumption) and state k G Si, let t|, be the smallest value oft(£ Wlfor 
which X + thi = ^fc + fk(^)APl in the optimal solution to (BALANCE). By Lemma 4.2 t\ is well-defined 
for every k G Si. 

Definition 7. For arm i, partition the states Si into states Gi,Ii as follows: 



1. kGGi if API < 0. (By Lemma 4.3 tl = 1.) 



2. keli if API > 0. 

With the notation above, the policy is now presented in Figure [TO] In this policy, if arm i has been in 
state k ^Ti for less than t\ steps, it is defined to be "not ready" for play. Once it has waited for t\. steps, it 
becomes "ready" and can be played. Moreover, if arm i moves to a state in /c G Qi, it is continuously played 
until it moves to a state in Xj. 

Intuitively, the states in Qi are the "exploitation" or "good" states. On the contrary, the states in Tj are 
"exploration" or "bad" states, so the policy waits until it has a high enough probabihty of exiting these states 
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BalancedIndex Policy 

1. Exploit: Some arm i moves to a state k E Qi: 

(a) Play this arm exclusively as long as it remains in a state in Qi . 

(b) Goto step (2). 

2. Explore: 

(a) Play any "ready" arm i in state k G Ti- 

(If no arm is "ready", do not play at this step.) 

(b) If the arm moves to state in Qi: goto Step (1), else goto Step (2a). 



Figure 10: The BalancedIndex Policy for Monotone MAB. 



before playing them. In both cases, tj. corresponds to the "recovery time" of the state, which is 1 in a "good" 
state but could be large in a "bad" state. 

Lyapunov Function Analysis. We use a Lyapunov (potential) function argument to show that the policy 



described in Figure 10 is a 2-approximation. Define the potential $j for each arm i at any time as follows. 
(Recall the definition of from Definition [6j as well as the quantities A, hi from the optimal solution of 
(Balance).) 

Definition 8. If arm i moved to state k ^ Si some y steps ago (y > 1 by definition ), the potential $ j is 
pi + h^{mm{y,ti) - 1). 

Therefore, whenever the arm i enters state k, its potential is If k £ li, the potential then increases at 
rate hi for — 1 steps, after which it remains fixed until the arm is played. Our policy plays arm i only if 
its current potential is + — 1). 

We finally complete the analysis in the following lemma. The proof crucially uses the "balance" property 
of the dual, which implies that \ = YliK > OPT/2. Let <I>r denote the total potential, Ya=i ^^y 
step T and let Rt denote the total reward accrued until that step. Define the function Ct = t ■ OPT /2 — 
jlj. _ <i)y_ Let A.Rt = Rt+i - Rt and A$t = $t+i - ^t- 

Lemma 4.4. Ct is a Lyapunov fiinction. i.e., E[£x'4_i|£t] < Ct- Equivalently, at any step: 

E[Ai?r + A<I>T|i?T, > OPT/2 

Proof. At a given step, suppose the policy does nothing, then all arms are "not ready". The total increase in 
potential is precisely A^t = J2i^i^ OPT/2. 

On the other hand, suppose that the policy plays arm i, which is currently in state k and has been in that 
state for y > tl, steps. The change in reward ARt = r^. Moreover, the current potential of the arm must be 

= Pk + hi{tl, — 1). The new potential follows the following distribution: 



I , with probability fl{y)q'{k, ^ k 
[pi, with probability 1 - X^^yfe fl{y)qKk, j) 

Therefore, if arm i is played, the change in potential is: 



B[A'^T]=fliy) Yl {Q\k,m-pl))-h.{tl-l) 

From the description of the Index policy, y = = 1 if k £ Qi. Therefore, y might be strictly greater 
than ti only when k G 2^. In that case AP* > by Definition |7| so that /j(y)AP^ > f'j{ti)APi by the 
Monotone property (since y > tl). 



21 



Therefore, for the arm i being played, regardless of whether k £ Qi or k £ li, 



ARt + B[A^t] = rl + fi{y)APi,-h,{ti-l) 
> rl + fl(iDAPi-h,li + h, 
= \ + hi> OPT/2 

where the last equality follows from the definition of t\ (Definition [6]). Since the potentials of the arms not 
being played do not decrease (since all hi > Q), the total change in reward plus potential is at least OPT/ 2. 



This completes the proof. Refer Figure 1 1 for a "picture proof when k £ Zi. □ 



By their definition, the potentials 'I'j' are bounded independent of the time horizon, by telescoping 
summation, the above lemma implies that limi^^oo ^ y'^^ > OPT/ 2. We finally have: 

Theorem 4.5. The BalancedIndex policy is a 2 approximation for MONOTONE bandits. 




Figure 1 1 : Proof of Lemma 4.4 The growth of the potential <I> is shown on the lower piece- wise linear 
function. The upper set of curves represent the LHS and RHS of the LP constraints for state k £ Si. The 
tight point tl, is where the potential switches to being constant. 



4.5 Lower Bounds: Necessity of Monotonicity and Separability 

We show that Monotone bandits is NP-Hard, and that if the Monotone property is relaxed even slightly, 
the problem either has integrality gap for Whittle's LP, or becomes n'^-hard to approximate. 

Input Specification. In the above discussion, we assumed the input to the MONOTONE bandits problem is 
specified by polynomial size state spaces Si for each arm; the associated matrices g% and functions fl{t) that 
are piecewise linear with poly-size specification. We can model this problem as a restless bandit problem in 
the sense defined in literature by replacing each state k £ Si with exponentially many states {kt, t £ Z+}; 
if the arm is not played, it transitions deterministically from state kt to kt+i, but if played in state kt, it 
transitions w.p. q^{k, j)fl{t) to state ji for each j £ Si, and with the remaining probability transitions to ki. 
The reduction uses exponentially many states, and is unlike the typical formulation of restless bandits that 
assumes the state space of each arm is poly -bounded. (The PSPACE-Hardness proofs of restless bandits Ii4rj 
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assumes poly-bounded state space as well.) We therefore need to use different NP-Hardness proofs for our 
compact input specifications. 

Theorem 4.6. For the special case of the problem with K = 2 states per arm and n arms, the following are 
true even when the functions fl are piece-wise linear with poly -size specification: 

1. Computing the optimal ergodic policy for MONOTONE bandits is NP-Hard. 

2. If the Monotone property is relaxed to allow arbitrary (possibly non-monotone) functions f, then 
the problem becomes rf hard to approximate unless P = NP. 

Proof. We reduce from the following periodic scheduling problem, which is shown to be NP-Complete 
in Q: Given n positive integers li,l2, ■■■ ,ln such that Y17=i — 1' there an infinite sequence of 
integers {1,2, . . . , n} such that for every i G {1,2,..., n}, all consecutive occurrences of i are exactly /j 
elements apart. Given an instance of this problem, for each i G {1, 2, . . . , n}, we define an arm i with a 
"good" state g and a "bad" state w. 

For part 1, for every arm i, let r* = 1, and = 0. Set q^{g, w) = 1 and /^(i) = 1 for all t. Moreover, 
set q^{w, gi) = 1 and /^(t) = if t < 2/j — 2 and 1 otherwise. Suppose for a moment that we only have 
arm i, then the optimal policy will play the arm exactly 2li — 1 steps after it is observed to be in w, and 
the arm will transition to state g. The policy will then play the arm in state g to obtain reward 1, and the 
arm will transition back to state w. Since this policy is periodic with period 2/j, it yields long term average 
reward exactly ■ It is easy to see that any other ergodic policy of playing this arm yields strictly smaller 
reward per step. Any policy of playing all the arms therefore has total reward of at most Yll=i w- 
for any ergodic policy, the reward of Y17=i W achievable only if each arm i is played according to its 
individual optimal policy, which is twice in succession every 2li steps. But deciding whether this is possible 
is equivalent to solving the periodic scheduling problem on the Zj. Therefore, deciding whether the optimal 
policy to the MONOTONE bandit problem yields reward Ya=i W NP-Hard. 

For part 2, we make w a trapping state with no reward. For arm i, set q''{g, w) = q^{w, g) = 1; and 
/^(Zi) = and /^(t) = 1 for all t / k. Furthermore, fl{t) = for all t. Also set = k and r^, = 0. 
Therefore, for any arm i, any policy will obtain reward from this arm if and only if it chooses the start state to 
be g, and plays the arm periodically once every li steps to obtain average reward 1. Therefore, approximating 
the value of the optimal policy is the same as approximating the size of the largest subset of {/i, /2, . . . , /«} 
so that this subset induces a periodic schedule. The NP-Hardness proof of periodic scheduling in [9] shows 
that this problem as hard as approximating the size of the largest subset of vertices in a graph whose induced 
subgraph is bipartite, which is rf hard to approximate |[38l unless P = NP. □ 

In the above proof, we showed that the problem becomes hard to approximate if the transition probabil- 
ities are non-monotone. However, that does not address the question of how far we can push our technique. 
We give a negative result by showing that Whittle's LP can have arbitrarily large gap even if the MONOTONE 
bandit problem is slightly generalized by preserving the monotone nature of the transition probabilities, but 
removing the additional separable structure that they should be of the form fl{t)q^{k,j). In other words, 
the transition probability from state k to state j / A; if played after t steps is q\j{t) - these are arbitrary 
monotonically non-decreasing functions of t. We insist J2jytk 1jk(^) — ^ ^'^^ ^' * ^'^ ensure feasibility. 
We show that Whittle's LP has r2(n) gap for this generalization. 

Theorem 4.7. If the separability assumption on transition probabilities is relaxed, Whittle's LP has 0,{n) 
gap even with K = 3 states per arm. 

Proof. The arms are all identical. Each has 3 states, {g, b, a}. State a is an absorbing state with reward. 
State g has reward 1, and state b has reward 0. The transition probabilities are as follows: qab{t) = Qag{t) = 
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0. Further, qgh{t) = 1/2, ^^^(l) = 0; and qga{t) = 1/2 for t > 2. Finally, qi,a{t) = qbg{t) = for 
t < 2n - 1; qbg{2n - 1) = 1/2; qba{2n - 1) = 0; and qba{t) = qbg{t) = 1/2 for i > 2n. 

A feasible single arm policy involves playing the arm in state b after exactly 2n — 1 steps (w.p. 1/2, the 
state transitions to g), and continuously in state g (w.p. 1/2, the state transitions to b). This policy never 
enters state a. The average rate of play is 1/n. The per-step reward of this policy is 0(l/n). Whittle's LP 
chooses this policy for each arm so that the total rate of play is 1 and the objective is 0(1). 

Now consider any feasible policy that plays at least 2 arms. If one of these arms is in state g, there is 
a non-zero probability that either this arm is played after t > 1 steps, or the other arm in state b is played 
after t > 2n steps. In either case, w.p. 1/2, the arm enters absorbing state. Since this is an infinite horizon 
problem, the above event happens w.p. 1. Therefore, any feasible policy is restricted to playing only one 
arm in the long run, and obtains reward at most 1/n. □ 



5 Monotone Bandits: Multiple Simultaneous Plays of Varying Duration 

In this section, we extend the index policy for MONOTONE bandits to handle multiple plays of varying 
duration.We use the same problem description as in Section [4j except we assume there are M > 1 players, 
each of which can play one arm every time step. (Therefore, M plays can proceed simultaneously per step.) 

Furthermore, we assume that if arm i in state /c G 5j is played, this play takes > 1 steps and during 
this time, this player cannot play another arm. We note that the values are fixed beforehand, and the 
players are aware of these values. When the player plays arm i in state k, he/she is forced to remain on arm 
i for steps, and he/she only receives one reward of magnitude r^, at the beginning of this "blocking" 
period. 

Suppose when the current play begins, the previous play had ended t > 1 steps ago. Then, at the end 
of the current play, the arm transitions to one of the states j ^ k w.p. q'^{k, j)fl{t), and with the remaining 
probability stays in state k. In Section |4j we focused on the case where M = 1 and all = 1. 

Since the overall algorithm and analysis are very similar to that in Section |4j we simply outline the 
differences. First, Whittle's LP gets modified as follows: 

n 

Maximize X! X! ^k^kt (WHITTLE) 

2=1 kGSi t>l 

Ekes,J:t>iit + Ll~l)xl, < 1 Vz 

E,e5.,,^fcEt>i4t9'(fc,j)/I.(t) = Ejes,,,^kEt>i^)t'fU,k)m ^fce'^^ 

4t > Vi, fce5,,t>i 

In the above formulation, the first constraint merely encodes that in expectation M arms are played per step. 
Note that each play of arm i in state k lasts steps, and the play begins with probability so that the 
steady state probability that arm i in state k is being played at any time step is J2t>i ^k^kf Note now that 
if the play begins after t steps, then the arm was idle for f — 1 steps before this event. Therefore, the quantity 
J2t>ii^ + — 1)2;^.^ would be the steady state probability that the arm i is in state k. This summed over all 
k must be at most 1 for any arm i. This is the second constraint. The final constraint encodes that the rate of 
leaving state k in steady state (LHS) must be the same as the rate of entering state k (RHS). 



5.1 Balanced Program and Complementary Slackness 

The balanced linear program is in Fig. 12 (Recall the definition of APj:{t) from Equation |2]) Next, 
Lemma [42] gets modified as follows: 
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Minimize M A + 


n 

hi (Balance) 

2 — 1 




+ K) + {t- > 


rl + fl{t)APl{t) Vz,keS,,teWl 




MX = 






X,hi > 


Vi 


Fi^ 


;ure 12: The linear pro 


gram (Balance) that we actually solve. 



Lemma 5.1. In the optimal solution to (Balance), one of the following is true for every arm i with hi > 0: 
Either repeatedly playing the arm yields per-step reward at least A + hi; or for every state k G 5^, there 
exists t £ Wl such that the following LP constraint is tight with equality. 

LUX + K) + {t- l)h^ >rl + fmAPlit) (6) 

We next split the arms into two types: 

Definition 9. i. Arm i G Ui if repeatedly playing it yields average per-step reward at least A + hi. Our 
policy described in the next section favors these arms and continuously plays them. 



2. Ann i if i ^ Ui and hi > 0. Note that for i G U2, V/c, 3 t G that makes Inequality ([6j) tight. 
Lemma 5.2. For any arm i G U2 and state k G Si, ifAPJ^(t) < 0, then: 

Ll{X + hi) = rl + fl{l)APl{t) 

5.2 BalancedIndex Policy 

Definition 10. For each i and state k G Si, let t\ be the smallest value oftG W^for which Inequality 



pi) is tight. By Lemma 4.2 t\ is well-defined for every k G Si. 



Definition 11. For arm i G U2, partition the states Si into states Qi,Ii as follows: 



I. keGi if^Plit) < 0. (By Lemma 5.2 = 1.) 



2. keli ifAPlit) > 0. 



Finally, the BalancedIndex policy is described in Figure 13 Note that any arm z G C/2 that is 
observed to be in a state in Qi is continuously played until its state transitions into Xj. This preserves the 
invariant that at most M — \Ui\ arms i G C/2 are in states k e Gi Sit any time step. 



5.3 Lyapunov Function Analysis 

Define the potential for each arm in U2 at any time as follows. 

Definition 12. If arm i G C/2 moved to state k £ Si some y steps ago (y > 1 by definition), the potential is 

Pk + hi{mm{y,tl) - 1). 

Therefore, whenever the arm i £ U2 enters state k, its potential is p\. If G Xj, the potential then 
increases at rate hi for % — 1 steps, after which it remains fixed until a play completes for it. When our 
policy decides to play arm i £ U2, its current potential is + — 1). 

We finally complete the analysis in the following lemma. The proof crucially uses the "balance" property 
of the dual, which states that MX = J2ihi> OPT/2. Let denote the total potential at any step T and 
let Rt denote the total reward accrued until that step. Define the function Ct = T • OPT/2 — Rt — 
Let ARt = Rt+i - Rt and A^t = ^r+i - '^'r- 
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BalancedIndex Policy at any time step 

Continuously play niin(M, \ Ui\) arms in Ui . I* Execute the remaining policy only \f\Ui\<M*l 
Invariant: At most M ~ \Ui \ arms i E U2 aie in states k E Gi- 

If a player becomes free, prioritize the available arms in U2 as follows: 

(a) Exploit: Choose to play an arm i in state k E Gi- 

(b) Explore: If no "good" arm available, play any "ready" arm i in state k Eli- 

(c) Idle: If no "good" or "ready" arm available, then idle. 



Figure 13: The Index Policy. 



Lemma 5.3. Ct is a Lyapunov function, i.e., E[£r+i|£T] < Ct- Equivalently, at any step: 

E[Ai?r + A$t|-Rt, > OPT/2 

Proof. Arms i ^Ui are played continuously and yield average per step reward A + /ij, so that for any such 
arm i being played, E[Ai?T] = A + /ij. 

Next focus on arms i e U2- As before, it is easy to show that when played, regardless of whether k G Qi 

or k £ li, 

Ai?i + E[A$,] = ri + fl{y) ^ {q^(k, j^p] - pD) - h,{tl - 1) 
> 4 + fl{ti)APl{t) - h^iA - 1) = Ll{\ + h,) 



where the last equality follows from the definition of (Definition 10 1. Since the play lasts time steps, 
the amortized per step change for the duration of the play, ARt + E[A$t]> is equal to X + hi. 

We finally bound the increase in reward plus potential at any time step. At step T, let Sg denote the arms 
in C/i and those in U2 in states k € Qi. Let 5^ denote the "ready" arms in states A; € Tj, and let Sn denote 
the set of arms that are not "ready". There are two cases. lf\SgU Sr\ > M, then some Sp <^ SgU Sr with 
\Sp\ = M is being played. 

ARt + E[A$t] >^{X + h^)>MX> OPT/2 

Next, if \Sg U Sr\ < M, then all these arms are being played. 

ARt + E[A$t] = 5^ (A + /li) + ^ /li > ^ /li = ^ /i* > OPT/2 

i&SgUSr i€S„ i&SgUSrUSn i 

Since the potentials of the arms not being played do not decrease (since all hi > 0), the total change in 
reward plus potential is at least OPT/2. □ 



Theorem 5.4. The BalancedIndex policy in Figure \T3j is a 2 approximation for MONOTONE bandits 
with multiple simultaneous plays of variable duration. 
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6 Monotone Bandits: Switching Costs 



In several scenarios, playing an arm continuously incurs no extra cost, but switching to a different arm incurs 
a closing cost for the old arm and a setup cost for the new arm. For the applications mentioned in Section [T| 
in the context of UAV navigation [40|, this is the cost of moving the UAV to the new location; or in the case 
of wireless channel selection, this is the setup cost of transmitting on the new channel. 

We now show a 2-approximation for MONOTONE Bandits when the cost of switching out of arm i is Cj 
and the cost of switching into arm z is Sj. This cost is subtracted from the reward. Note that the switching 
cost depends additively on the closing and setup costs of the old and new arms. The remaining formulation 
is the same as Section IH 

Since the overall policy and proof are very similar to the version without these costs, we only outline 
the differences. First, we define the following variables: Let denote the probability of the event that arm 
i in state k is played after t steps and this arm was switched into from a different arm. Let yl^ denote the 
equivalent probability when the previous play was for the same arm. The LP relaxation is as follows: 

n 

Maximize (''k (xlt + vlt) " (cz + s,) xlt) (LPSwiTCH) 

i=l keSi t£Wl 
Si=l J2keS^ X^tgW* ^kt + ^Vkt ^ 1 

Ekes.EtewiH^lt + ylt) < 1 vi 

The balanced dual is the following. (Recall the definition of AP^(t) from Equation|2]) 

n 

Minimize A + ^ /i^ (DualSwitch) 

\ + th, > ri-c,-s, + fl{t)APl{t) yi,k,t 
t{X + h,) > rl + fl{t)APl{t) Vz,fc,t 

\,h, > Vi 



The proof of the next claim follows from complementary slackness exactly as the proof of Lemma 4.2 



Lemma 6.1. In the optimal solution to (DualSwitch), one of the following is true for every arm i with 
hi > 0: Either repeatedly playing the arm yields per-step reward at least A + hi; or for every state k £ Si, 
there exists t G such that one of the following two LP constraints is tight with equality: 

1. X + thi >rl-Ci-Si + fl{t)APl{t). 

2. t{X + h,)>rl + fl{t)APl{t). 



Only consider arms with hi > 0. The next lemma is similar to Lemma 4.3 



Lemma 6.2. For any arm i and state k G Si, if AP^{t) < 0, then: A + /ij = + /^(l)APJ(t). 



For arm i, let denote the smallest t for which some dual constraint for state k (refer Lemma 6.1 1 is 
tight. The state k £ Si belongs to Qi if the second constraint in Lemma 6.1 is tight at t = t^, i.e.: 



tiiX + h,) = rl + fi{tl)APl{t) (7) 



By Lemma 6.2 this includes the case where AP^(t) < 0, so that = 1. 
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Otherwise, the first constraint in Lemma 6.1 is tight att = t\. This state k belongs to 2j, and becomes 
"ready" after steps. 

With these definitions, the BalancedIndex policy is as follows: Stick with an arm i as long as its 
state is some k ^ Qi, and play it after waiting t\. — l steps. Otherwise, play any "ready" arm. If no "good" 
or "ready" arm is available, then idle. 

Theorem 6.3. The BalancedIndex policy is a 2-approximationfor Monotone bandits with switching 
costs. 



Proof. The definitions of the potentials and proof are the same as Lemma 4.4 The only difference is that the 
potential of state A: G is defined to be fixed at p\. Whenever the player sticks to arm i in state k ^ Qi and 
plays it after waiting — 1 steps, the reward plus change in potential amortized over the t\. steps (waiting 
plus playing) is exactly A + /ij by Eq. (|7]l. The rest of the proof is the same as before. □ 



7 Feedback MAB with Observation Costs 

In wireless channel scheduling, the state of a channel can be accurately determined by sending probe packets 
that consume energy. However, data transmission at high bit-rate yields only delayed feedback about channel 
quality. This aspect can be modeled by decoupling observation about the state of the arm via probing, from 
the process of utilizing or playing the arm to gather reward (data transmission). We model this as a variant 
of the Feedback MAB problem, where at any step, M arms can be played without observing its state, and 
the reward of the underlying state is deposited in a bank. Further, any arm can be probed by paying a cost 
to determine its underlying state, and multiple such probes are allowed per step. The goal is to maximize 
the difference between the time average reward and probing cost. A version of the probe problem was first 
proposed in a preliminary draft of ll27l . 

Formally, we consider the following variant of the FEEDBACK MAB problem. As before, the underlying 
2-state Markov chain (on states {g, 6})) corresponding to an aim evolves in^espective of whether the aim is 
played or not. When arm i is played, a reward of or (depending on whether the underlying state is g 
or h respectively) is deposited into a bank. Unlike the FEEDBACK MAB problem, the player does not get 
to know the reward value or the state of the arm. However, during the end of any time step, the player can 
probe any arm i by paying cost a to observe its underlying state. We assume that the probes are at the end 
of a time step, and the state evolves between the probe and the start of the next time step. More than one 
arm can be probed and observed any time step, but at most M arms can be played, and the plays are of unit 
duration. The goal as before is to maximize the infinite horizon time-average difference between the reward 
obtained from playing the arms and the probing cost spent. Denote the difference between the reward and 
the probing cost as the "value" of the policy. 

Though the probe version is not a MONOTONE bandit problem, we show that the above techniques can 
indeed be used to construct a policy which yields a 2 + e-approximation for any fixed e > 0. 



7.1 LP Formulation 

Let OPT denote the value of the optimal policy. The following is now an LP relaxation for the optimal 
policy. Let x*^ (resp. x\^) denote the probability that arm i was last observed to be in state g (resp. b) t 
time steps ago and played at the current time step. Let Zg^ (resp. zl^) denote the probability that arm i was 
last observed to be in state g (resp. b) t steps ago and is probed at the current time step. The probes are at 
the end of a time step, and the state evolves between the probe and the start of the next time step. The LP 
formulation is as follows, as before the LP can be solved upto a 1 + e factor. 
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Maximize ^ ^ {ri{uitxlt + vuxlt) - Ci{zlt + ^J) 

i=l t>l 

Er=iEt>i^^t + 4t < M 

Et>ii(4t + 4) < 1 vi 

a^lt < Ei>tKi yi,t>l,s€{g,b} 

Et>i(i-w**)4t = Et>i^^it4t 

x^^zl, > yi,t>l,se{g,b} 

The dual assigns a variable (pl^ > for each arm i, state s G {5, 6}, and last observed time i > 1. It 
further assigns variables hi,pi>0 per arm i, and A > globally. Let Rl^ be the expected reward of playing 
arm i in state s when last observed time is t. (R^gf = ViUu, R^gf = riVu.) The balanced dual is as follows: 

Minimize MX + ^ hi 

i 

X + (l>i, > Ri, yi,t>l,s€{g,b} 

thi > -Ci - (1 - uu)vi + J2i<t ^li Vi, t > 1 

thi > -a + VitPi + J2i<t ^li yi,t>i 
MX = Zihi 

X,hi,p„(j)l^ > yi.se{g,b} 

We omit explicitly writing the corresponding primal. Note now that in the dual optimal solution, cpl^ = 
max(0, A), s G {g,b}. (This is the smallest value of satisfying the first constraint, and whenever we 
reduce we preserve the latter constraints while possibly reducing hi.) Moreover, we have the following 
complementary slackness conditions: 

1. ^,>o^Et>ii(4 + 4t) = i + ^>o- 

2. 4t > ^ = -Ci - (1 - Uit)pi + J2i<t ^ii- 

3. 4^ > ^ thi = -a + VitPi + J2i<t 4>li 

Lemma 7.1. Focus only on arms for which hi > 0. For these arms, we have the following. 

1. For at least one t>l, z* ^ > 0, and similarly, for some (possibly different) t, > 0. 

2. Let di = min{t > 1, > 0}, then dihi = —Ci + Vi^^pi + E/<dj 'Pit- Further, define rrii = {{(pli > 
: / < di}\, then (j)^^^ > 0/or di — rrii + 1 < I < di and (j)^^^ = Ofor I < di — rrii. 

3. Let Si = min{t > 1, -z^^ > 0}. Then, for all t < Ci, X + (p^g^ = Rgp Moreover, ej(A + hi) = 

J2t<ei ^It - Ci - {I - Uie^)pi. 

Proof. For part (1), by complementary slackness and using hi > 0, we have Et>i ^(^pt + ^It) > ^- ^ 
Zg^ > for some t, then by Et>i(l ~ '^itj^lt — Et>i ''^it^lv ^^^^ ^It ^ some (possibly different) 
t. The reverse holds as well. 

Part (2) follows by complementary slackness on zl^. > 0. The second part follows from the fact that 
is non-decreasing since Rli is non-decreasing. 

For part (3), since zl^. > 0, by complementary slackness, Sihi = —Ci — (1 — Uiei)pi + Et<ei ^gt- 
that <;i!)* J = max(0, i?*^ — A). If = 1, then since the LHS is positive, it must be that > 0, which implies 
that = — A. If Cj > 1, then we subtract (cj — l)hi > —Ci — (1 — '"j(ei-i))Pj + J2t<e,-i ^om the 
equality and get hi < {uie^ — Uie^-i)pi + (t^^ge^- The LHS is positive and the first term of the RHS is negative, 
so (/>*g. > 0. Since by the above formula is non-increasing, (j)^g^ > Oyt < e^. This in turn implies that 
(pgf = Rg^ — A for all t < ei. Substituting this back into the equaUty yields the second result. □ 
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7.2 Index Policy 

Let the set of arms with /ij > be S, we ignore all arms except those in S. The policy uses the parameters 
Cj, di and rrii defined in Lemma [tTT] If arm i was observed to be in state b, we denote it "not ready" for the 
next di — rrii steps, and denote it to be "ready" at the end of the {di — rriiY^ step. 



Constraint: 


The policy keeps initiating "Stage 1" with ready arms if less than M arms are in either stages. 


Stage 1: 


/* Arm i is last observed to be b but has turned "ready" */ 




1. Try. Play the arm for the next rrii steps. 




2. Probe the arm at the end of mf' step. 




If state is g, then go to Stage 2 for that arm. 


Stage 2: 


/* Arm i just observed to be in state g */ 




1 . Exploit. Play the arm for the next steps. 




2. Probe the arm at the end of ef^ step. 




If state is g then go to Step (1) of Stage (2). 



Figure 14: The policy for FEEDBACK MAB with observations. 



Theorem 7.2. The policy in Fig. 14 is a 2 + e approximation to FEEDBACK MAB with observation costs. 



Proof. Let OPT denote the 1 + e approximate LP solution. Recall that 0*^ = max(0, — A), s G {g, h}. 

Define the following potentials for each arm i. If it was last observed to be in state b some t steps ago, 
define its potential to be (min(t, di — mi))hi; if it was last observed in state g, define its potential to be pi. 
We show that the time-average expected value (reward minus cost) plus change in potential per step is at 
least min(MA, X^jg^ hi) > OPT/2. Since the potentials are bounded, this proves a 2-approximation. 

Each ready arm in Stage 1 is played for rrii steps and probed at the end of the mj^ step. Suppose that 
the arm was last observed to be in state b some t steps ago. The total expected value is —a + Yll=T'~^ ^li' 
which is at least — Cj + YlfLd -m +i ^li ^iiice = riVu is non-decreasing in /. The expected change in 
potential is — {di — mi)hi, since the arm loses the potential build up of {di — mi)hi while it was 

not ready, and has a probability of of becoming good. This is at least Vid^Pi — {di — mi)hi since 

by definition, t > di — rrii + 1. After rrii steps, the total expected value plus change in potential is at least 
-Ci + Ef=A-m.+i ^>l + VidrPi - (dt - mi)hi > rriihi + Etd,-m,+iiRbi " The inequahty follows 
by Lemma 7.1 Part (2). Since Rl^ — = X for di — rrii + 1 <l < di, the total expected change in value 
plus potential is mi{\ + hi). Thus, the average per step for the duration of the plays is at least \ + hi. (This 
proof also shows that if rrii = 0, then the probing on the previous step does not decrease the potential.) 

Similarly, each arm i in Stage 2 was probed and found to be good, so that it is exploited for a steps 
and probed at the end of the ef^ step. During these steps, the total expected value is X^^g. R^gt — Ci, and 
expected change in potential is — ( 1 — Uie^ )pi , since the arm has probability ( 1 — u J of being in a bad state 
at the end. By Lemma 7. 1 Part (3), the total expected value plus change in potential is "^tKsi ^gt ~ Ci — 
(1 — Uie^)pi = ej(A + hi), so the average change per step is A + hi. 

Now, if M arms are currently in Stage 1 or 2, then the total value plus change in potential for these arms 
is at least MX > OPT/2. If fewer than M arms are in those stages, then every arm i that is not in Stage 1 
or Stage 2 is in state b and not "ready". Thus, its change in potential is hi. Moreover, for every arm j that is 
in Stage 1 or Stage 2, we also get a contribution of at least X + hj > hj. Summing, we get a expected value 
plus change in potential of at least J2ieS — OPT/2, which completes the proof. □ 
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8 Non-Preemptive Machine Replenishment 



Finally, we show our technique of balancing provides a 2-approximation for an unrelated, yet classic, restless 
bandit problem ifTOl l23l [39l : modeling breakdown and repair of machines. Interestingly, we also show that 
the Whittle index policy is an arbitrarily poor approximation to non-preemptive machine replenishment, and 
thus the technique we suggest can be significantly stronger than the Whittle index policies. 

There are n independent machines whose performance degrades with time in a Markovian fashion. This 
is modeled by transitions between states yielding decreasing rewards. At any step, any machine can be 
moved to a repair queue by paying a cost. The repair process is non-preemptive, Markovian, and can work 
on at most M machines per time step. A scheduling policy decides when to move a machine to a repair 
queue and which machine to repair at any time slot. The goal is to find a scheduling policy to maximize 
the time-average difference between rewards and repair cost. Note that if an arm is viewed as a machine, 
playing it corresponds to repairing it, and does not yield reward. In that sense, this problem is like an inverse 
of the Monotone bandits problem. We emphasize that the repairs are non-preemptive, which means that 
once a repair is started, it cannot be stopped. 

Formally, there are n machines. Let Si denote the set of active states for machine i. If the state of 
machine i is u S 5^ at time the beginning of time t, the state evolves into v G 5^ at time t + 1 w.p. p^^. 
The state transitions for different machines when they are active are independent. If the state of machine i 
is n G 5j during a time step, it accrues reward > 0. We assume each Si is poly-size. 

During any time instant, machine i in state u € Si can be scheduled for maintenance by moving it to 
the repair queue starting with the next time slot by paying cost c„. The maintenance process for machine 
i takes time which is distributed as Geometric(sj), independent of the other machines. Therefore, if the 
repair process works on machine i at any time step, this repair completes after that time step with probability 
Si. During the time when the machine is in the repair queue, it yields no reward. When the machine is in 
the repair queue, we denote its state by Kj. The maintenance process is non-preemptive, and the server can 
maintain at most M machines at any time. When a repair completes, the machine i returns to its "initial 
active state" pi G Si at the beginning of the next time slot. The goal is to design a scheduling policy so that 
the time- average reward minus maintenance cost is maximized. 

In related work, Munagala and Shi ||39ll showed using a novel queuing analysis that when the repair 
process is preemptive, M = 1, and when Si = {pi, bi} for all machines i, and r;,. = 0, so that the machine 
is either "active" (state pi) or "broken" (state bi), then a simple greedy policy that is equivalent to the Whittle 
index policy is a 1.51 approximation. However, as we show later, the Whittle index policy can be arbitrarily 
bad for non-preemptive repairs since it computes indices for each machine separately. We now show that our 
duality-based technique yields a 2-approximation policy with general Si, M, and non-preemptive repairs. 

8.1 LP Formulation and Dual 

We now present an LP bound on the optimal policy. For any policy, let Xu denote the steady state probability 
that machine i is in state u during a time step, and denote the steady state probability that the machine i 
transitions from state u G 5^ to state Kj. We assume the policy moves a machine to the repair queue at the 
beginning of a time slot, and that repairs finish at end of a time slot. Note that it does not make sense to 
repair a machine in its initial state Xp^ so Zp- = 0. 

Maximize ^ ^ (r„a:„ 
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Zu,Xu > yi,u€SiU{Ki} 

The dual of the above LP assigns potentials (j)u for each state u e Si. Further, it assigns a value hi > 
for each machine i, and a global variable A > 0. We directly write the balanced dual: 

Minimize MX + ^ hi 

i 

X + hi > Si(j)p^ Vi 

hi > r-„ + P«-u(0f - i^tt) yi,ueSi 

^u + Cu > yi,uG Si 
MX = Y.ihi 

X,hi > Vi 

Note that MX = hi > OPT/2. We omit explicitly writing the corresponding primal formulation. 
Now, Focus only on machines for which hi > 0. We have the following complementary slackness conditions: 

1. hi> 0^ x^i + ^ueSi = 1 - > 

2. xu>0^ hi = ru + J2veSi Puv{4>v - 4>u)- 

3. Zu > =^ (f)u + Cu = 0. 

4. x^. > ^ A + /ij = Si(f)p^. 
8.2 Index Policy and Analysis 



Scheduling: Consider only machines i with hi > 

If the machine in in state u where Zu > 0, then move the machine to repair queue. 

Repair: Service any subset Sw of at most M machines in the repair queue non-preemptively. 
w.p. Si, the service for i E Syj completes and it moves to state pi. 



Figure 15: The repair policy from our LP-duality approach 



Claim 8.1. Consider only machines with hi > 0. There are two cases: 

L For machines in which z^ > Ofor some v, we have > Q so that the policy can only reach states 
u G Si in which Xu + Zu > 0. 

2. For machines in which z^ = 0/or all v, we have x^^ = 0. The policy will never repair the machine, 
and after a finite number of steps,, the machine will only visit states u G Si for which x„ > 0. 

Proof. Adding the third and fourth constraints of the primal yields Sj^K^ = "^ZueSi some v, z^ > 0, 

then X|^^ > 0, which by the fourth constraint in the primal implies that Xp^ > 0. Now, suppose that x^ > 0, 
then for every state u such that p^^ > 0, the third constraint in the primal implies that Zu + Xu > 0. If 
Zu > 0, then the poUcy will stop at state u and enter machine i into the repair queue. If Zu = 0, then it must 
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be that Xu > 0. Repeatedly using the above argument starting at v = /Oj, we see that the policy will only 
visit states with Xu + Zu > 0, not going beyond the first state where Zu > 0. 

For machines in which z^ = for all v, conditions (3) and (4) in the primal imply that {x^} are the 
steady state probabilities of a Markov chain with transition matrix [puv] ■ Therefore, after a finite number of 
steps, the machine will only go to states u ^ Si for which Xu > 0- □ 



Theorem 8.2. The policy in Fig. 15 is a 2-approximation for non-preemptive machine replenishment. 



Proof. We next interpret 0u as the potential for state u G Si. Let the potential for state Hi be 0. We show 
that in each step, the expected reward plus change in potential is at least OPT /2. 

First, when any active machine i enters a state u with Zu > 0, then the machine is moved to the repair 
queue by paying cost c^. The potential change is —(j)u, and the sum of the cost and potential change is 
—Cu — (pu- The last term is by complementary slackness. Therefore, moving a machine to the repair queue 
does not alter the potential. 

Next, let Sr denote the set of machines in the repair queue, and let Sw C Sr denote the subset of these 
machines being repaired at the current time. Note that if 1 5^1 < M, then = Sr, otherwise \Sw\ = M. 
For each machine i £ S^, the repair finishes w.p. Sj, and the machine's potential changes by (pp-. Therefore, 
the expected change in potential per step is Sj</)p. = A + /ij by complementary slackness. 

Suppose first that \Sw\ = M, then the net reward plus change in potential is at least M(A + hi) > 
MX > OPT/2. Suppose that \Sw\ < M, then must have Sw = Sr. Note that any machine that enters a 
state u with > will be automatically moved to Sr at the beginning of the time step. Using this along 



with Claim 8.1 we have that after a finite number of steps, any machine i ^ Sr enters a state u with Xu > 0. 
(Since we care about infinite horizon average reward, the finite number of steps don't matter.) The reward 
plus change in potential for machine i ^ Sr is + X^^g^ Puv{4'v — (pu) = hi by complementary slackness. 
Therefore, the total reward plus change in potential is X^ieS', ^'^ ~^ ~^ Yli^s,- ^« — ^« — OPT/2. 
Since the potentials are bounded, the policy is a 2- approximation. □ 



8.3 Gap of the Whittle Index 

We now show that the Whittle index policy is an arbitrarily poor approximation to non-preemptive machine 
replenishment. Note that in the situation shown below. Whittles index is a 1.51 approximation when repairs 
can be preempted [39]. However, when no preemption is allowed, the policy can perform arbitrarily poorly. 

Theorem 8.3. The Whittle index policy is an arbitrarily poor approximation for non-preemptive machine 
replenishment even with 2 machines and M = 1 repairs per step. 

Proof. Suppose M = 1, there are two machines {1,2}, and Si = {pi,bi} for machines i G {1,2}. Let 
Tp. = ri and let r;,. = 0, so that the machine is either "active" (state pi) or "broken" (state bi). Let pi denote 
the probability of transitioning from state pi to hi. Assume Cj = 0. Note that playing a machine corresponds 
to moving it to the repair queue. 

The Whittle index of a state is the largest penalty that can be charged per maintenance step so that the 
optimal single machine policy will still schedule the machine for maintenance on entering the current state. 
In 2-state machines mentioned above, the Whittle index in state pi is negative, since even with penalty zero 
per repair step, the policy will not schedule the machine for maintenance in the good state. The Whittle 
index for state bi is rji = SiVi/pi, since for this value of penalty, the expected reward of ri/pi per renewal is 
the same as the expected penalty of rn / Si paid for maintenance in the renewal period. 

Suppose si = 1/n'^, S2 = 1, Pi = ^/n, p2 = 1, n = r2 = 1. If used by itself, machine 1 yields 
reward rij^^ 1/n^ and machine 2 yields reward r2j^q^ = 1. Any reasonable policy will therefore 
only maintain machine 2 and ignore machine 1. However, in the Whittle index policy, when machine 1 is 
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broken and machine 2 is active, the policy decides to maintain machine 1 (since the Whittle index, 7]i, of 61 
is positive and that of p2 is negative). In this case, machine 1 is scheduled for repair. This repair takes 0{n'^) 
time steps and cannot be interrupted. Moreover, since machine 2 is bad at least half the time, this "blocking" 
by machine 1 will happen with rate 0(1/2), so in the long run, machine 2 is almost always broken and the 
Whittle index policy obtains reward O ( 1 /n'^ ) , while the optimal policy obtains reward r2 = 1 /2 by only 
maintaining machine 2. □ 



9 Open Questions 

Our work throws open interesting research avenues. First, can our algorithmic techniques be extended to 
other subclasses of restless bandits, for instance, the POMDP problem obtained by generalizing FEEDBACK 
MAB to > 2 states per arm? Note that unlike the K = 2 case considered here, the transition probability 
values are no longer monotone as they are based on an underlying Markov chain. Next, can matching hard- 
ness results be shown for these problems, particularly FEEDBACK MAB. Finally, our analysis effectively 
uses piece-wise linear Lyapunov functions. Such functions derived from LP relaxations have also been used 
by Bertsimas, Gamarnik, and Tsitsiklis [ 1 1 1 to show stability in multi-class queuing systems. Though the 
techniques and results in that work are very different from ours, it would be interesting to explore whether 
our techniques extend to multi-class queuing problems. 
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A Omitted Proofs 

The following proofs are deferred here because they are independent of our duality based technique, and 
because of their lengths, we fear that they may detract from the paper's flow. 
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A. 1 Proof of Theorem O 



We show examples in which the myopic policy and the optimal index policy exhibit the desired gaps against 
the optimum. 

Gap of the Myopic Policy. We now show an instance where the myopic policy that plays the arm with 
the highest expected next-step reward has gap r2(n) with respect to the reward of the optimal policy. The 
instance is an extension of the instance constructed above. There is one "type 1" deterministic arm with 
reward ri = 1. There are n i.i.d. "type 2" arms with r2 = n, (3 = i^, and = ^. 

First consider the myopic policy. Any policy encounters an instant where all the type 2 arms are in state 
h. In this case, the myopic next step reward of any of these arms is at most ?"2;5^ < 1, so that the policy 
always plays the type 1 arm, yielding long-term reward of 1. 

Next consider the myopic policy that ignores the type 1 arm. Such a policy performs round-robin on the 
arms when it observes all of them to be in state b. In this case, the probability that the arm it plays will be 
in state g is at least Vn > Therefore, the behavior of this policy is dominated by the following 2-state 
Markov chain: The two states are h and /; state h yields reward n, and state /, reward 0. The transition 
probabilities from h to I and vice-versa are i^. The long-term reward is therefore at least ^, which lower- 
bounds the reward of the optimal policy. 

Non-optimality of Index policies. We will now show an instance where there is a constant factor gap 
between the optimal policy and the optimal index policy. The example has 3 arms. Arm 1 is deterministic 
with reward ri = 1. Arms 2 and 3 are i.i.d. with a = (3 = 0.1 and reward in state g being r2 = 2. 

We compute the optimal policy by value iteration |10| using a discount factor of 7 = 0.99 (to ensure 
the dynamic program converges). The optimal policy always plays arms 2 or 3 if either was just observed in 
state g. The decisions are complicated only if both arms 2, 3 were last observed in state b. In this case, we 
can compactly represent the current state by the pair {ki, G Z"*" x Z+, representing the time steps ago 
that arms 2, 3 were observed in state b respectively. For such a state, the policy either plays arm 1; or plays 
arms 2 or 3 depending on whether ki > ^2 or not. Such a policy is therefore completely characterized by 
the region V on the {ki, /C2) plane where the decision is to play arm 1; in the remaining region, it plays arm 
2 or 3 depending on whether ki > /c2 or not. For the optimal policy, we have: 

= {(fci,A;2) G Z+ X Z+ I ki<4,k2<4,ki + k2<6} 

In other words, the description of the optimal policy is as follows (note that it is symmetric w.r.t. arms 2,3): 



State {ki, k2) 


Play Arm 


ki < 3,k2 < ki 


1 


ki=4,k2<2 


1 


fei = 4, A:2 = 3 


2 


A;i = 4, A:2 > 4 


3 


ki >5,k2< ki 


2 



Table 1: Description of the optimal policy when both arms 2, 3 were last observed to be b. Note that the 
policy is symmetric w.r.t. arms 2, 3, and furthermore, ki ^ k2. 

Note the following non- index behavior where given the state of arms 1 and 2, the decision to play 
switches between these arms depending on the state of arm 3. If arm 2 was observed to be b some 4 steps 
ago, then: (i) If arm 3 was b some 2 steps ago, the policy plays arm 1; (ii) If arm 3 was b some 3 steps ago, 
the policy plays arm 2. To compute the reward of this policy, we observe that it has an equivalent description 
as a Markov Chain over 6 states (these new states correspond to groups of states in the original process). A 
closed form evaluation of this chain shows that the reward of the optimal policy is 1.46218. 
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We next perform this evaluation for the nearby index policies. Note that for any index policy, the region 
V has to be an axis-parallel square. The first is where the decision for ki = A,k2 < 2 is to play arm 2, so 
that V = {{ki, k2)\ki, /c2 < 3}. This policy evaluates to an average reward 1.46104. The next is where 
the decision for fci = 4,A;2 = 3 is to play arm 1, so that V = {{ki,k2)\ki,k2 < 4}. This policy has 
reward 1.46167. Other index policies have only worse reward. This implies that there is a constant factor 
gap between the optimal policy and the best index policy. 




Figure 16: Behavior of the function F(A, t). 



A.2 Proof of Theorem |221 

Since the proof focuses on a single arm i, we omit the subscript for the arm. For notational convenience, 
denote t* = t{\). The expression for ^^(A) follows easily from Lemma 2.5 Recall from Definition [5] that 

F(A, t) = R{t) - \Q{t) as the value of policy V{t). 



Case 1: A > 

^ ( a+/3(«+/3) )■ Consider the subcase A > r. The function F(\,t) is maximized by driving 
the expression (which is always non-positive) to zero. This happens when t = oo. Otherwise, when r > A 
observe (using the upper bound of vt) that 

^ {r-\)vt-\(3 ^ (^-A)^-A/3 
^ ' ' vt + tf3 - vt + tp 

The above is now non-positive, and it follows again that t = oo is the optimum solution. 



Case 2: In this case let A = r j — p for some p > 0. Rewrite the above expression as 

Define the following quantities (independent of t): 

Q 1 Q 

i/ = (l-a-/3) r]= — —5 log- (/> = f?AH r^(^--^) 

p, = r]{r-X) uj = Xp- a — — = -p — 

a + p a + p 

Observe r — X > p. Note that (/), p > 0. By assumption, the value v G {6, 1] has polynomial bit 
complexity. The same holds for rj, cj), p and p. Relaxing t to be a real, observe: 
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Since the denominator of dF/dt is always non-negative, the value of t* is either t* = 1, or the point 
where the sign of the numerator g{t) = (0 + fit)ij^ + lu changes from + to — . We observe that g{t) has a 
unique maximum at = iog(i/u) ~ Ji- di^s) negative then the numerator of ^^g^'*^ is always negative 
and the optimum solution is at t* = 1. 

If ^(ta) is positive, then it cannot change sign from + to — in the range [1, t^) since it has a unique 
maximum. Therefore in this range t = l,t = [t^l ,ort= [is] are the optimum solutions. 

But for t > is since g{t) is decreasing, dF/dt changes sign once from + to — as t increases, and — > 
as t — > oo. This behavior is illustrated in Figure[T6] Therefore, we find a t4 > such that g{t4) < 0, and 
perform binary search in the range [t^, t^] to find the point where F is maximized. It is easy to compute 
t4 with polynomial bit complexity in the complexities of t], (j), fi and p. We finally compare this maximal 
value of F to the values of F at 1, [taj , [is] . Thus we can solve H{X) and obtain t* in polytime. 

A.3 Proof of Theorem l2J2l 

We first show the structure of the optimal solution to (Whittle). Using the notation from Definition [3j we 
have: Hi{\) = Ri,{X) - XQi{X). Let R{X) = Y.1=i RiW and Q(A) = ^"^^ Q,{X). The following lemma 
shows that the optimal solution to (Whittle) is obtained by choosing A such that Q{X) 1. 

Lemma A.l. The optimal solution to Whittle's LP chooses a penalty X* and a fraction a G [0, 1], so that 
a(5(A*L) + (1 — a)Q{X\) = 1. Here, Al < A* < X*^ with \X*^ — A!L| — > 0. The solution corresponds to a 
convex combination ofVi{ti{X*_)) with weight a and Vi{ti{X\)) with weight 1 — a for each arm i. 

Proof. For the optimal solution to (Whittle), recall that OPT denote the expected reward. The expected 
rate of playing the arms is exactly 1 by the LP constraint. 

When A = 0, then tj(A) = 1 for all i, implying Q{X) = n. Similarly, when A = Amax > max^rj, 
ti{X) = oo for all i, so that Q{X) = 0. Therefore, as A is increased from to Amax, there is a transition 
value A* such that Q{X*_) = Qi>l, and Q{XX) = Q2 < 1; furthermore, \X\ - A^| ^ 0. 

Since the solution to (Whittle) is feasible for LPLagrange(A), we must have: 

R{X\) - X*Q2 > OPT - X* R{X*_) - X*Qi > OPT - X* 
Let a = Q^^Q^ , then taking the convex combination of the above inequalities, we obtain: 

aR{X*_) + (1 - a)R{XX) > OPT 

aQ{X*_) + {1 - a)Q{XX) = I 
This completes the proof. □ 



To prove Theorem 2.12 consider n i.i.d. arms with n/? <C 1, a = /5/(n — 1) and r = 1. Each arm is in 
state g w.p. 1/n, so that all arms are in state b w.p. 1/e and the maximum possible reward of any feasible 
policy is 1 — 1 /e even with complete information about the states of all arms. 

We will show that Whittle's LP has value 1 — 0{\/n/3) for n/3 <^ 1. Since the LP is symmetric w.r.t. the 



arms, it is easy to show (from Theorem A. 1 1 that for each arm, it will construct the same convex combination 
of two single-arm policies. The first policy is of the form V{t — 1), and the second is of the form V{t). The 
constraint is that if these policies are executed independently, exactly one arm is played in expectation per 
step. Since V{t) has lower average reward and rate of play than V{t — 1), we consider the sub-optimal LP 
solution that uses policy V{t) for each arm. 
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The policy 'P(t) always plays in state g, and in state 6, waits t steps before playing. The value t is chosen 
so that the rate of play for each arm is less than 1/n, and V{t — 1) has a rate of play larger than 1/n. The 
rate of play of the single arm policy V{t) is given by the formula: Q{t) = Since this is 1/n, we have 

vt = P{t — n)/{n — l). The reward of each arm is R{t) = j^^^ = ^^^^^^j, so that the objective of Whittle's 
LP is nR{t) = 1 - e(n/t). 

Now, from vt = l3{t - n) /{n - 1), we obtain 1 - (1 - = - n), where (3' = a + P = /?^. 
This holds for t = Q{^Jn/ j3) provided n/3 ^ 1. Plugging this value of t into the value nR{t) of Whittle's 



LP completes the proof of Theorem 2.12 



A.4 Proof of Lemma 13.11 



Recall the notation from Section 2.2 and Definition [3] We first present the following structural lemma 



about the optimal single-arm policy Lj(A). Suppose this poUcy is of the form Vi{ti{\)), where ti{\) 
argmax^>iFi(A,t). 

Lemma A.2. ti{\) is monotonically non-decreasing in A. 

Proof. We have: ^'^g^'*'* = —Qi{t) = — ^^l^tp- ■ Since Qi{t) is a decreasing function of t, the above is an 
increasing function and always negative, which implies that for smaller t, the function Fi{X,t) decreases 
faster as A is increased. This implies that if ti(A) = argmax^>]^Fj(A, t), then for A' > A, the maximum of 
Fj(A', t) is attained for some ti{X') > ti(A). □ 

Now note that when A = 0, there is no penalty, so that the single-arm policy maximizes its reward by 
playing every step regardless of the state. Therefore, Ili{s, t) > for all states (s, tj^ 

Suppose the arm is in state {g, 1). The immediate expected reward if played is rj(l — Pi). If the penalty 
A < rj(l — a policy that plays the arm and stops later has positive expected reward minus penalty. 
Therefore, for penalty A, the optimal decision at state {g, 1) is "play", so that Ili{g, 1) > rj(l — Pi). We 
now show that Ili{g, 1) = rj(l — Pi). Suppose the penalty is A > rj(l — If played in state {g, 1), the 
immediate expected reward minus penalty is negative, and leads to the policy being in state {g, 1) or (6, 1). 
The best possible total reward minus penalty in the future is obtained by always playing in state {g, 1) and 
waiting as long as possible in state (6, 1) (since this maximizes the chance of going to state g if played). 
Whenever the arm is played in state b after w steps, the probability of observing state g is at most . 
Consider two consecutive events of the policy when the last play was in state {g, 1) and the current observed 
state is (6, 1). Since the optimal such policy is ergodic, this interval would define a renewal period. In this 

period, the expected penalty is at least A ( + ^ ) > ^'^d the expected reward is ^. Therefore, the next 



expected reward minus penalty in the renewal period is: 

x'^<n(l-il-P^)(l + ^))=-rA{l-P.-a.)<0 



Pi OL^ \ \ oii J J a 



The last inequality follows since ai + Pi < 1 — 5 for a 5 > specified as part of input. This implies that 
if X > ri{l — Pi), the any policy that plays in state {g,l) has negative net reward minus penalty, showing 
that "not playing" is optimal. Therefore, Ili{g, 1) = ri{l — Pi). 

Next assume that for penalty slightly less than rj(l — Pi), the policy decision is to "play" in state (6, t). 
Consider the smallest such t. Since the policy also decides to play in (5, 1), consider the renewal period 
defined by two consecutive events where the policy when the last play was in state (g, 1) and the current 

''Note that this is true only for FEEDBACK MAB where the underlying 2-state process evolves regardless of the plays; the claim 
need not be true for MONOTONE bandits defined in Section[4| where even with penalty A = 0, the arm may idle in certain states. 
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observed state is (6, 1). The reward is ^ and the penalty is A + -^j. Since A = rj(l — /3j), and 

^jt < a°+i3- ' the above analysis shows that the net expected reward minus penalty is negative in renewal 
period. Therefore, the decision in (6, t) is to "not play", so that Iii{b, t) < rj(l — /3j). 

Finally, for any A < rj(l — consider the smallest t > 1 so that the optimal decision in state (6, t) is 
to "play". If this is finite, the optimal policy for this A is precisely Li{\) = Vi{ti{\)). From Lemma A.2[ the 
function tj(A) is non-decreasing in A. Therefore, for any state (6, t*), the quantity max{A|Lj(A) = Vi{t*)} 
is well-defined. For larger values of penalty A, we have ti(A) > t* , so that the decision in (6, t*) is "do 
not play". Therefore, nj(6, t) = max{A|Li(A) = Vi{t)}. Since tj(A) is non-decreasing in A, the function 



Iii{h, t) is non-decreasing in t. This completes the proof of Lemma 3.1 
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